Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.
The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
This is the second in a series of seven lectures. Last lecture we discussed the basic functions and structures of R as well as how to navigate them. This week we will focus more on the data.frame object and learning how to manipulate the information it holds.
At the end of this session you will be familiar with importing data from plain text and excel files; filtering, sorting, and re-arranging your data.frames using the dplyr package; the concept of piping command calls; and writing your resulting data to files. Our topics are broken into:
dplyr package to filter, subset and manipulate your data and to perform simple calculations. Grey background: Command-line code, R library and function names... fill in the code here if you are coding alongEach week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.
An Excel book that we will be used to show how we can import even entire Excel books into R.
This dataset is the result of 16S rRNA gene amplicon sequencing of samples from microbial communities cultured in fresh, brackish, or saline media. Treatments received the aromatic compounds toluene or pyrene as the sole source of carbon and energy. Controls did not receive any compounds (substrate-free) to account for any alternative carbon sources present in the media. The objective of this experiment was to evaluate which microorganisms would make use of toluene and pyrene.
We will use the microbes.csv dataset to learn how to manipulate our data using dplyr.
Packages are groups of related functions that serve a purpose. They can be a series of functions to help analyse specific data or they could be a group of functions used to simplify the process of formatting your data (more on that later in this lecture!).
Depending on their structure they may also rely on other packages.
There are a few different places you can install packages from R. Listed in order of decreasing trustworthiness:
CRAN (The Comprehensive R Archive Network)
Bioconductor (Bioinformatics/Genomics focus)
GitHub
Joe's website
Regardless where you download a package from, it's a good idea to document that installation, especially if you had to troubleshoot that installation (you'll eventually be there, I promise!)
devtools is a package that is used for developers to make R packages, but it also helps us to install packages from GitHub. It is downloaded from CRAN.
Installing packages through your JupyterHub notebook is relatively straightforward but any packages you install only remain during your current instance (login) of the hub. Whenever you logout from the JupyterHub, these installed libraries will essentially vaporize.
The install.packages() command will work just as it should in R and RStudio. Find instructions in the Appendix section of Lecture 01 for installation of packages into your own personal Anaconda-based installation of Jupyter Notebook.
install.packages('devtools') # Always keep installation commands commented out
Installing package into 'C:/Users/mokca/Documents/R/win-library/4.0' (as 'lib' is unspecified) also installing the dependencies 'credentials', 'diffobj', 'gert', 'cachem', 'waldo', 'usethis', 'desc', 'memoise', 'pkgbuild', 'pkgload', 'remotes', 'testthat', 'withr'
package 'credentials' successfully unpacked and MD5 sums checked package 'diffobj' successfully unpacked and MD5 sums checked package 'gert' successfully unpacked and MD5 sums checked package 'cachem' successfully unpacked and MD5 sums checked package 'waldo' successfully unpacked and MD5 sums checked package 'usethis' successfully unpacked and MD5 sums checked package 'desc' successfully unpacked and MD5 sums checked package 'memoise' successfully unpacked and MD5 sums checked package 'pkgbuild' successfully unpacked and MD5 sums checked package 'pkgload' successfully unpacked and MD5 sums checked package 'remotes' successfully unpacked and MD5 sums checked package 'testthat' successfully unpacked and MD5 sums checked package 'withr' successfully unpacked and MD5 sums checked package 'devtools' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\mokca\AppData\Local\Temp\RtmpgZIeZt\downloaded_packages
R may give you package installation warnings. Don't panic. In general, your package will either be installed and R will test if the installed package can be loaded, or R will give you a non-zero exit status - which means your package was not installed. If you read the entire error message, it will give you a hint as to why the package did not install.
Some packages depend on previously developed packages and can only be installed after another package is installed in your library. Similarly, that previous package may depend on another package and so on. To solve this potential issue we use the dependencies logical parameter in our call.
install.packages('devtools', dependencies = TRUE)
# remove.packages("devtools") # Uninstall any CRAN package
library() to load your packages after installation¶A package only has to be installed once. It is now in your library. To use a package, you must load the package into memory. Unless this is one of the packages R loads automatically, you choose which packages to load every session.
library() Takes a single argument. library() will throw an error if you try to load a package that is not installed. You may see require() on help pages, which also loads packages. It is usually used inside functions (it gives a warning instead of an error if a package is not installed).
library(devtools)
# or
#library('devtools')
Warning message: "package 'devtools' was built under R version 4.0.5" Loading required package: usethis Warning message: "package 'usethis' was built under R version 4.0.5"
BiocManager()¶To install from Bioconductor you can either always use BiocManager() to help pull down and install packages from the Bioconductor repository.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager") # this piece of code checks if BiocManager is installed.
# If is not installed, it will do it for you. It does nothing if BiocManager is already installed.
BiocManager::install("GenomicRanges")
#or
#BiocManager::install(c("GenomicRanges", "ConnectivityMap"))
package::function()¶As mentioned above, devtools is required to install from GitHub. We don't actually need to load the entire library for devtools if we are only going to use one function. We select a function using this syntax package::function().
devtools::install_github("tidyverse/googlesheets4")
All packages are loaded the same regardless of their origin, using library().
# Load googlesheets4 now from the library
library(googlesheets4)
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for you, like dplyr, readr, readxl, tibble, and ggplot2)writexl used for writing multiple datasets to excel files#--------- Install packages to for today's session ----------#
#install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
install.packages("writexl", dependencies = TRUE) # This package is NOT already installed on Jupyter Hub
#--------- Load packages to for today's session ----------#
library(tidyverse)
# readxl, used for reading xlsx files, is not a core component of the tidyverse
library(readxl)
library(writexl)
Installing package into 'C:/Users/mokca/Documents/R/win-library/4.0' (as 'lib' is unspecified)
package 'writexl' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\mokca\AppData\Local\Temp\RtmpgZIeZt\downloaded_packages
Warning message: "package 'tidyverse' was built under R version 4.0.5" -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- v ggplot2 3.3.3 v purrr 0.3.4 v tibble 3.1.1 v dplyr 1.0.6 v tidyr 1.1.3 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.1 Warning message: "package 'ggplot2' was built under R version 4.0.5" Warning message: "package 'tibble' was built under R version 4.0.5" Warning message: "package 'tidyr' was built under R version 4.0.5" Warning message: "package 'dplyr' was built under R version 4.0.5" Warning message: "package 'forcats' was built under R version 4.0.5" -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Warning message: "package 'writexl' was built under R version 4.0.5"
Jupyter Notebooks generally do a good job of installing packages but if you want a little more control over the process, you can do so via Anaconda. Open up the Anaconda prompt and install your packages of interest.
conda install - starting command to call the installer for anaconda.
-c conda-forge - look for the package in the 'conda-forge' channel.
r-packagename - the name of the package you're interested in installing.
Combine the parts into a single command like:
conda install -c conda-forge r-essentials # This will install tidyverse along with the other dependencies
conda install -c conda-forge r-googlesheets4
The most important thing when starting to work with your data is to know how to load it into the memory of the R kernel. There are a number of ways to read in files and each is suited to dealing with specific file types, file sizes or may peform better depending on how you wish to read/store the file (all at once, or a line at a time, or somewhere in between!
tibble with read_csv()¶The tidyverse package has it's own function for reading in text files because the tibble structure was first developed as part of the dplyr package! If we'll be spending our time working with the tidyverse then we may as well use their commands for importing files! If you want to learn how to do this with the base R utils package, check out the Appendix section for details.
Let's look quickly at the read_csv() function which is a specific version of the read_delim() function from the readr package. The arguments we are interested in are:
file: The path to the file you want to importcol_names: TRUE (there is a header), FALSE, or a character vector of custom names you want to use for your data columnscol_types: NULL (default) and decides on column types itself, or a cols() specification of the data type for each column. Find more information in the ?read_csv details.na: a character vector of strings to interpret as NA values. Very handy when you have values you want to identify and convert at import.From this point on, we'll pretty much use the terms tibble and data.frame interchangeably.
# ?read_csv
# Import our microbe.csv file from the data folder
microbes <- read_csv(file = "data/microbes.csv",
col_names = TRUE,
col_types = cols() # Producing a blank cols() specification suppresses the normal parsing output
)
# Check out the structure of our table
str(microbes)
spec_tbl_df[,12] [6,656 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ abundance: num [1:6656] 40.69 11.71 11.13 6.14 3.97 ... $ compound : chr [1:6656] "compound-free" "compound-free" "compound-free" "compound-free" ... $ salinity : chr [1:6656] "brackish" "brackish" "brackish" "brackish" ... $ group : chr [1:6656] "control" "control" "control" "control" ... $ replicate: num [1:6656] 1 1 1 1 1 1 1 1 1 1 ... $ kingdom : chr [1:6656] "Bacteria" "Bacteria" "Bacteria" "Bacteria" ... $ phylum : chr [1:6656] "Proteobacteria" "Proteobacteria" "Firmicutes" "Firmicutes" ... $ class : chr [1:6656] "Gammaproteobacteria" "Alphaproteobacteria" "Clostridia" "Bacilli" ... $ order : chr [1:6656] "Pseudomonadales" "Rhodospirillales" "Clostridiales" "Lactobacillales" ... $ family : chr [1:6656] "Pseudomonadaceae" "Rhodospirillaceae" "Lachnospiraceae" "Carnobacteriaceae" ... $ genus : chr [1:6656] "Pseudomonas" "Candidatus" "Lachnoclostridium" "Trichococcus" ... $ ASV : chr [1:6656] "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGC"| __truncated__ "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCT"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCC"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCC"| __truncated__ ... - attr(*, "spec")= .. cols( .. abundance = col_double(), .. compound = col_character(), .. salinity = col_character(), .. group = col_character(), .. replicate = col_double(), .. kingdom = col_character(), .. phylum = col_character(), .. class = col_character(), .. order = col_character(), .. family = col_character(), .. genus = col_character(), .. ASV = col_character() .. )
As you can see, it's a pretty smooth process to parse simple text files. We'll learn some additional functions as we become familiar with the tidyverse package as well.
readxl package¶What happens if we have an excel file? The readxl() package, which is installed as part of the tidyverse package, will recognize both xls and xlsx files. It expects tabular data.
Note that while we have already loaded the tidyverse package, we will need to explicitly load readxl so we can use the read_excel() function to accomplish our task. Some parameters we are interested in are:
path: The path to the file you want to import.sheet: The sheet you want to read either as a string (sheet name) or integer (position).col_names: TRUE (there is a header), FALSE, or a character vector of custom names you want to use for your data columns.col_types: NULL (default) and decides on column types itself, or a character vector containing the column types listed as "blank", "numeric", "date", or "text".na: a character vector of strings to interpret as NA values. Very handy when you have values you want to identify and convert at import.range: a way to specify a rectangular area to take data from your excel file. First, let's open our excel file with read_csv().
# read_csv() doesn't work for excel files
head(read_csv("data/miscellaneous.xlsx"))
! Multiple files in zip: reading ''[Content_Types].xml'' -- Column specification -------------------------------------------------------- cols( `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>` = col_character() )
| <?xml version="1.0" encoding="UTF-8" standalone="yes"?> |
|---|
| <chr> |
| <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"><Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/><Default Extension="xml" ContentType="application/xml"/><Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/><Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/><Override PartName="/xl/worksheets/sheet2.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/><Override PartName="/xl/worksheets/sheet3.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/><Override PartName="/xl/worksheets/sheet4.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/><Override PartName="/xl/worksheets/sheet5.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/><Override PartName="/xl/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/><Override PartName="/xl/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml"/><Override PartName="/xl/sharedStrings.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml"/><Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/><Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/></Types> |
Looks like it didn't work...
# The readxl package is not a core component of the tidyverse so we need to load it
library(readxl) # Note that we've already loaded it in section 1.5.0
# let's take a peek at the miscellaneous.xlsx sheet
head(read_excel("data/miscellaneous.xlsx"))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.9 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
excel_sheets()¶Why doesn't our output look like a workbook? The read_excel() function defaults to reading in the first worksheet. You can specify which sheet you want to read in by position or name. Let's see what the name of our sheets are using the excel_sheets() function.
The excel_sheets() function returns a character vector as output.
# grab the excel sheet names
excel_sheets("data/miscellaneous.xlsx")
read_excel()¶If we want to get fancy, it is possible to subset from a sheet by specifying cell numbers or ranges. Here we are grabbing sheet 1 (microbes), and subsetting cells over a range defined by two cells - A2:D9.
For our purposes, the read_excel() function takes the form of read_excel(path, sheet = NULL, range = NULL) but there are additional parameters we can supply to the function. See ?read_excel for more information.
# read in a specific sheet and range with read_excel()
read_excel(path = "data/miscellaneous.xlsx",
sheet = 1, range = "A2:D9")
| 40.69 | compound-free | brackish | control |
|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> |
| 11.71 | compound-free | brackish | control |
| 11.13 | compound-free | brackish | control |
| 6.14 | compound-free | brackish | control |
| 3.97 | compound-free | brackish | control |
| 3.90 | compound-free | brackish | control |
| 2.88 | compound-free | brackish | control |
| 2.54 | compound-free | brackish | control |
We could alternatively specify the sheet by name. This is how you would simply grab rows.
# read in an excel files by a specific row range
read_excel("data/miscellaneous.xlsx", sheet = "microbes", range = cell_rows(1:9))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
| 2.88 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAATTGCAGAGGAACATAGTTGAAAGATTATGGCCGCAAGGTCTCTGTGAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTTATCATTAGTTACTAACAGGTCATGCTGAGGACTCTAGTGAGACTGCCGTCGTAAGATGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCTACACACGTGTTACAATGGGGGGTACAGAGGGCAGCTACCGGGCGACCGGATGCCAATCCCAAAAACCTCTCTCAGTTCGGATCGAAGTCTGCAACCCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 2.54 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Shewanellaceae | Shewanella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCTCAGAAGCCAGCGGAGACGCAGGTGTGCCTTCGGGAACTGAGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATCCTTACTTGCCAGCGGGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCGGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCTCATAAAGCCGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACA |
Note that if your first row is the header, excluding this row will result in data filling in the header unless you include the parameter col_names = FALSE.
Likewise, how you would subset just columns from the same sheet?
# read in an excel files by a specific column range
head(read_excel("data/miscellaneous.xlsx",
sheet = "microbes", range = cell_cols("B:D")))
| compound | salinity | group |
|---|---|---|
| <chr> | <chr> | <chr> |
| compound-free | brackish | control |
| compound-free | brackish | control |
| compound-free | brackish | control |
| compound-free | brackish | control |
| compound-free | brackish | control |
| compound-free | brackish | control |
lapply() is the list version of apply()¶How would we read in all of the sheets at once? In one solution you can also use lapply(), a version of the apply() function, to read in all sheets at once. lapply() returns a list object of the same length as X, for which each element is the result of applying FUN to the corresponding element of X. Note that the elements of the returned list could be any kind of object!
We can use lapply() so that each sheet will be stored as a data.frame inside of a list object. Recall that apply() took in a matrix, a row/column specification (MARGIN), and a function.
lapply(), instead, drops the MARGIN parameter and takes in a vector or a list, a function, and any additional arguments for the function. Remember that lists are a single dimension and thus do not have a row/column configuration.
So far we have been accustomed to functions finding our variables globally (in the global environment), lapply() is looking locally (within the function) and so we need to explicitly provide our path. We will get more into local vs global variables in our control flow lesson (lecture 07). For now, just know we can read in all worksheets from an excel workbook.
# Use lapply and provide a list of excel sheet names, then apply a function to each element (Sheet name) of the list!
excel_sheets_list <- lapply(X = excel_sheets("data/miscellaneous.xlsx"), # this will set X to a character vector
FUN = read_excel,
path = "data/miscellaneous.xlsx" # This is an argument for read_excel()
)
# What is the structure of our sheets_list?
str(excel_sheets_list)
List of 5 $ : tibble[,12] [6,656 x 12] (S3: tbl_df/tbl/data.frame) ..$ abundance: chr [1:6656] "40.69" "11.71" "11.13" "6.14" ... ..$ compound : chr [1:6656] "compound-free" "compound-free" "compound-free" "compound-free" ... ..$ salinity : chr [1:6656] "brackish" "brackish" "brackish" "brackish" ... ..$ group : chr [1:6656] "control" "control" "control" "control" ... ..$ replicate: num [1:6656] 1 1 1 1 1 1 1 1 1 1 ... ..$ kingdom : chr [1:6656] "Bacteria" "Bacteria" "Bacteria" "Bacteria" ... ..$ phylum : chr [1:6656] "Proteobacteria" "Proteobacteria" "Firmicutes" "Firmicutes" ... ..$ class : chr [1:6656] "Gammaproteobacteria" "Alphaproteobacteria" "Clostridia" "Bacilli" ... ..$ order : chr [1:6656] "Pseudomonadales" "Rhodospirillales" "Clostridiales" "Lactobacillales" ... ..$ family : chr [1:6656] "Pseudomonadaceae" "Rhodospirillaceae" "Lachnospiraceae" "Carnobacteriaceae" ... ..$ genus : chr [1:6656] "Pseudomonas" "Candidatus" "Lachnoclostridium" "Trichococcus" ... ..$ ASV : chr [1:6656] "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGC"| __truncated__ "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCT"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCC"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCC"| __truncated__ ... $ : tibble[,16] [999 x 16] (S3: tbl_df/tbl/data.frame) ..$ UK's most borrowed library books : chr [1:999] "July 2009-June 2010" "206 bones" "7th heaven" "7th heaven" ... ..$ Desert Island Discs book choices : chr [1:999] "(Feb 2008-Feb 2011)" "Anna Karenina" "Blake" "Breakfast of Champions" ... ..$ Pulitzer Prize winners (Fiction 1948-, Novel pre-1948) : chr [1:999] "(1918-2010)" "A Bell for Adano" "A Confederacy of Dunces" "A Death in the Family" ... ..$ Askmetafilter.com Books Everyone Should Read : chr [1:999] "http://ask.metafilter.com/42616/A-book-everyone-should-read [accessed 23 Feb 2011]" "1984" "Aesop's Fables" "Against the Grain" ... ..$ LibraryThing.com (top 50) : chr [1:999] "http://www.librarything.com/z_books.php [accessed 23 Feb 2011]" "1984" "American Gods" "Angels and Demons" ... ..$ World Book Day Poll (top 100) : chr [1:999] "Latest relevant poll year is 2007: http://www.guardian.co.uk/books/2007/mar/01/news (Books you can't live witho"| __truncated__ "1984" "A Christmas Carol" "A Confederacy of Dunces" ... ..$ Telegraph 100 Novels Everyone Should Read : chr [1:999] "(2009) http://www.telegraph.co.uk/culture/books/4248401/100-novels-everyone-should-read.html [accessed 24 Feb 2011]" "1984" "A Bend in the River" "A Dance to the Music of Time" ... ..$ Goodreads.com Books That Everyone Should Read At Least Once (top 100): chr [1:999] "(Created July 11, 2008, ongoing) http://www.goodreads.com/list/show/264.Books_that_everyone_should_read_at_leas"| __truncated__ "1984" "A Christmas Carol" "A Clockwork Orange" ... ..$ Bspcn.com 30 Books Everyone Should Read Before They're 30 : chr [1:999] "(2010) http://www.bspcn.com/2010/08/03/30-books-everyone-should-read-before-they’re-thirty/ [accessed 24 Feb 2011]" "1984" "A Clockwork Orange" "Catch-22" ... ..$ Guardian 1000 Novels Everyone Must Read : chr [1:999] "(2009) http://www.guardian.co.uk/books/2009/jan/23/bestbooks-fiction [accessed 24 Feb 2011]" "1974" "1977" "1984" ... ..$ Bighow.com 100 Greatest Books of All Time Everyone Must Read : chr [1:999] "(2010) http://bighow.com/news/the-100-greatest-books-of-all-time-everyone-must-read [accessed 24 Feb 2011]" "1984" "A Clockwork Orange" "A Confederacy of Dunces" ... ..$ The Best 100 Lists Top 100 Novels of All Time : chr [1:999] "http://www.thebest100lists.com/best100novels/ [accessed 24 Feb 2011, last updated 2 Feb 2011]" "1984" "A Clockwork Orange" "A Confederacy of Dunces" ... ..$ Man Booker Prize winners : chr [1:999] "(1969-2010)" "Amsterdam" "Disgrace" "G." ... ..$ Oprah's Book Club List : chr [1:999] "http://www.oprah.com/oprahsbookclub/Complete-List-of-Oprahs-Book-Club-Books [updated Sept 17 2010, accessed 24 Feb 2011]" "A Million Little Pieces" "A New Earth" "Anna Karenina" ... ..$ 1001 Books You Should Read Before You Die (Cassell, 2005) : chr [1:999] "we didn't include this" NA NA NA ... ..$ Author's own top five... : logi [1:999] NA NA NA NA NA NA ... $ : tibble[,2] [2,003 x 2] (S3: tbl_df/tbl/data.frame) ..$ Title : chr [1:2003] "1974" "1977" "1984" "1984" ... ..$ No of mentions: num [1:2003] NA NA NA NA NA NA NA NA NA NA ... $ : tibble[,2] [246 x 2] (S3: tbl_df/tbl/data.frame) ..$ Title : chr [1:246] "To Kill a Mockingbird" "1984" "Catch-22" "Crime and Punishment" ... ..$ No of mentions: num [1:246] 11 9 9 9 9 9 9 9 8 8 ... $ : tibble[,1] [0 x 1] (S3: tbl_df/tbl/data.frame) ..$ add to Books Everyone Should Read http://www.brainpickings.org/index.php/2012/01/30/writers-top-ten-favorite-books/?fb_action_ids=10203689181294105&fb_action_types=og.likes: logi(0)
lapply()¶Remember the parameters of
`read_excel(path, sheet = NULL, range = NULL)`
Notice that the second position parameter is sheet. In our lapply() function assignment we didn't specifically name that parameter! Recall we used
`lapply(X= excel_sheets("data/miscellaneous.xlsx"), FUN = read_excel, path = "data/miscellaneous.xlsx")`
and thus explicitly named our first parameter path. The next available parameter by default order was sheet to which the elements of X were applied. We now have a list object with each worksheet being one item in the list.
You can subset the data.frame you would like to work with using the syntax list[[x]] and store it as a variable using data.frame().
# You can see the structure of our first list element.
# Remember the difference between [[]] and []?
str(excel_sheets_list[[1]])
tibble[,12] [6,656 x 12] (S3: tbl_df/tbl/data.frame) $ abundance: chr [1:6656] "40.69" "11.71" "11.13" "6.14" ... $ compound : chr [1:6656] "compound-free" "compound-free" "compound-free" "compound-free" ... $ salinity : chr [1:6656] "brackish" "brackish" "brackish" "brackish" ... $ group : chr [1:6656] "control" "control" "control" "control" ... $ replicate: num [1:6656] 1 1 1 1 1 1 1 1 1 1 ... $ kingdom : chr [1:6656] "Bacteria" "Bacteria" "Bacteria" "Bacteria" ... $ phylum : chr [1:6656] "Proteobacteria" "Proteobacteria" "Firmicutes" "Firmicutes" ... $ class : chr [1:6656] "Gammaproteobacteria" "Alphaproteobacteria" "Clostridia" "Bacilli" ... $ order : chr [1:6656] "Pseudomonadales" "Rhodospirillales" "Clostridiales" "Lactobacillales" ... $ family : chr [1:6656] "Pseudomonadaceae" "Rhodospirillaceae" "Lachnospiraceae" "Carnobacteriaceae" ... $ genus : chr [1:6656] "Pseudomonas" "Candidatus" "Lachnoclostridium" "Trichococcus" ... $ ASV : chr [1:6656] "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGC"| __truncated__ "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCT"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCC"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCC"| __truncated__ ...
tibble is essentially a data.frame¶Notice that the object type of our imported sheet isn't exactly a data.frame. Rather it is a tibble which is an extended version of the data.frame. Overall a tibble replicates the same behaviours as a data.frame except when printing/displaying (only output the first 10 rows vs. all) and in how we subset a single column. As long as you use methods from within the tidyverse, this construct will work just fine.
If you'd like to exclusively work with a data.frame, you can cast it using the data.frame() command.
# Pull a single column from our tibble
print("Indexing a column from a tibble is still a tibble")
str(excel_sheets_list[[1]][,1])
# Cast the tibble to a data.frame and then pull a single column
print("Indexing a column from a data.frame becomes a vector")
str(data.frame(excel_sheets_list[[1]])[,1])
[1] "Indexing a column from a tibble is still a tibble" tibble[,1] [6,656 x 1] (S3: tbl_df/tbl/data.frame) $ abundance: chr [1:6656] "40.69" "11.71" "11.13" "6.14" ... [1] "Indexing a column from a data.frame becomes a vector" chr [1:6656] "40.69" "11.71" "11.13" "6.14" "3.97" "3.9" "2.88" "2.54" ...
# Assign our first sheet to it's own variable
excel_sheets_list_microbes <- data.frame(excel_sheets_list[[1]])
str(excel_sheets_list_microbes)
'data.frame': 6656 obs. of 12 variables: $ abundance: chr "40.69" "11.71" "11.13" "6.14" ... $ compound : chr "compound-free" "compound-free" "compound-free" "compound-free" ... $ salinity : chr "brackish" "brackish" "brackish" "brackish" ... $ group : chr "control" "control" "control" "control" ... $ replicate: num 1 1 1 1 1 1 1 1 1 1 ... $ kingdom : chr "Bacteria" "Bacteria" "Bacteria" "Bacteria" ... $ phylum : chr "Proteobacteria" "Proteobacteria" "Firmicutes" "Firmicutes" ... $ class : chr "Gammaproteobacteria" "Alphaproteobacteria" "Clostridia" "Bacilli" ... $ order : chr "Pseudomonadales" "Rhodospirillales" "Clostridiales" "Lactobacillales" ... $ family : chr "Pseudomonadaceae" "Rhodospirillaceae" "Lachnospiraceae" "Carnobacteriaceae" ... $ genus : chr "Pseudomonas" "Candidatus" "Lachnoclostridium" "Trichococcus" ... $ ASV : chr "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGC"| __truncated__ "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCT"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCC"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCC"| __truncated__ ...
At this point, you will be able to use your excel worksheet as a normal data.frame in R. Notice above that our abundance column is identified as a character variable? How could we convert that to a numeric or double?
If you are a googlesheets person, there is a package (surprisingly called 'googlesheets4') that will allow you to get your worksheets in and out of R. For more information on googlesheets, checkout more at the tidverse/googlesheets4 page
Image courtesy of xkcd
We'll often make assumptions about our datasets, like all of the values for a variable are within a certain range, or all positive. We also usually assume that all of the entries in our data are complete - no missing values or incorrect categories. This can be a bit of a trap - especially in large datasets were we cannot view it all by eye. Here we'll discuss some helpful tools for inspecting your data before you start using more complex code for it.
When first importing data (especially from outside sources) it is best to inspect it for problems like missing values, inconsistent formatting, special characters, etc. Here, we'll inspect our dataset, store it in a variable, and check out the structure by reviewing some helpful commands:
class() to quickly determine the object type. You see this information in the str() command too.head() to quickly view just the first n rows of your data.tail() to quickly view just the last n rows of your data.unique() to quickly view the unique values in a vector or similar data structure.glimpse() and View() (in RStudio) to take a peek at your data structures.head() to view the first portion of your data¶You can take a look at the first few rows (6 by default) of your data.frame using the head() function. In fact you can play with the parameters to pull a specific number of rows or lines from the start of your data.frame or list.
# Re-import our microbe.csv file from the data folder if you need to
# microbes <- read_csv(file = "data/microbes.csv", col_names = TRUE, col_types = cols())
# Use default head() parameters
head(microbes)
# Pull just the first 3 rows
head(microbes, 3)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
tail() to view the latter portion of your data¶Likewise, to inspect the last rows, you can use the tail() function. Again, you can decide on how many rows you'd like to see from the end of your object.
# Let's pull up the last 10 rows to look at!
tail(microbes, 10)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0 | toluene | saline | treatment | 3 | Bacteria | Bacteroidetes | Sphingobacteriia | Sphingobacteriales | Saprospiraceae | Candidatus | GAATTGGCGGGGGTCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTAGAATGCGAGTGCCTGTGTGTGAAAGCATACATTCCTTCGGGACACAAAGCAAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATCTTCAGTTGCCAGCATTTAAGGTGGGGACTCTGAAGAGACTGCCGGCGTAAGCCGCGAGGAAGGTGGGGATGATGTCAAGTCATCATGGCCTTTATGCCCAGGGCTACACACGTGCTACAATGGCCGGTACAACGGGTCGCGAAGCTGTGAAGCGGAGCCAATCCTATAAAGCCGGTCTCAGTTCGGATTGGAGTCTGGAACTCGACTCCATGAAGGTGGAATCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGACCTTGCACA |
| 0 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Bacilli | Bacillales | Family.X | Thermicanus | GCAGCAGTAGGGAATCTTCGGCAATGGGCGAAAGCCTGACCGAGCAACGCCGCGTGAGTGAGGAAGGCCTTCGGGTTGTAAAACTCTGTTGTTTGGGAAGAAGGGAAAGGGTAGGCCCCTTAGGTGACGGTACCAAACGAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCGAGCGTTGTCCGGAATGATTGGGCGTAAAGCGCGCGCAGGCGGTCCTTTAAGTCTGATGTGAAAGCCCGCGGCTTAACCGCGGAAGGTCATTGGAAACTGGGGGACTTGAGGCTAGGAGAGGGAAGTGGAATTCCTGGTGTAGCGGTGAAATGCGTAGAGATCAGGAGGAATACCGATGGCGAAGGCAGCTTCCTGGCCTAGGGCTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Family.XVII | Sulfobacillus | GCAGCAGTAGGGAATTTTGGACAATGGGGGAAACCCTGATCCAGCGACGCCGCGTGCGCGACGAAGGCCTTCGGGTTGTAAAGCGCTGTCATCCGGGACGAAGGTCCTCTCTTCGAAGAGGGGAGAGGAATGACGGTACCGGAGGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCGAGCGTTGTCCGGAATGACTGGGCGTAAAGGGCGTCTAGGCGGCCTGGTAAGTCCGATGTGAAAGGCCACGGCTTAACCGTGGAGGGTCATTGGAAACTGTCAGGCTTGAGGGCAGTAGAGGGGTGCGGAATTCCCGGTGTAGCGGTGATATGCGTAGAGATCGGGAAGAACACCAGTGGCGAAGGCGGCACCCTGGGCTGGCCCTGACGCTAAAGCGCGAAAGCGTGGGGAGCGAACGGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidia.Incertae.Sedis | Prolixibacteraceae | Prolixibacter | GCAGCAGTGAGGAATATTGGTCAATGGGCGCAAGCCTGAACCAGCCATCCCGCGTGAAGGAAGACTGCCCTATGGGTTGTAAACTTCTTTTCTGTACCAAGAATTGCCCCTACGCGTAGGGGATTGACGGTAGTACAGGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTTTAAGTCAGTGGTGAAATCCTGCGGCTCAACCGTAGAACTGCCATTGATACTGAAGAGCTTGAATACAATTGAGGTAGGCGGAATGAGTAGTGTAGCGGTGAAATGCTTAGATATTACTCAGAACACCGATTGCGAAGGCAGCTTACCAAACTGATATTGACGCTGAGGCACGAAAGCGTGGGGAGCGAACAGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Thermotogae | Thermotogae | Thermotogales | Thermotogaceae | Mesotoga | GCAGCAGTGCGGAATTTTAGATAATGGAGGCAACTCTGATCTAGCGACGCCGCGTGCAGGAAGAAGGTCTTCGGATTGTAAACTGCTGTGGTAAGGGAAAAATGCCATGTAGAGTGGAAAGCTACATGGAGGGATGGTACTTTACTAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCGAGCGTTACCCGGAATCACTGGGCGTAAAGGGAGCGTAGGTGGCCTGACATGTCGACTGTGAAAACCCGGAGCTCAACTCCGGACTTGCAGTTGAAACTGCCAGGCTTGAGGACGGTAGAGGAAGACGGAACTGCCAGTGTAGGGGTAAAATCCTTAGATATTGGCAGGAACGCCGGTGACGAAGGTGGTTTTCTGGGCCGGTTCTGACACTGATGCTCGAAAGCCAGGGGAGCGAACGGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Pseudoalteromonadaceae | Pseudoalteromonas | GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTCAGGAGGAAAGGTTAGTAGTTAATACCTGCTAGCTGTGACGTTACTGACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTACGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATTTCGAACTGGCAAACTAGAGTGTGATAGAGGGTGGTAGAATTTCAGGTGTAGCGGTGAAATGCGTAGAGATCTGAAGGAATACCGATGGCGAAGGCAGCCACCTGGGTCAACACTGACGCTCATGTACGAAAGCGTGGGGAGCAAACAGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiaceae.4 | Caminicella | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCAACGCCGCGTGAGCGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTCCTAAGGGAAGAATAATGACGGTACCTTAGGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGAATCACTGGGCGTAAAGGGTGCGTAGGCGGCTAATCAAGCCAGAGGTGAAAGGCTACGGCTTAACCGTAGTAAGCCTTTGGAACTGAATAGCTTGAGTGCAGGAGAGGAGAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGCAAACAGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Vallitalea | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAAGGATGAAGGTTTTCGGATCGTAAACTTCTATCAGCAGGGAAGATAGTGACAGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGTGCGTAGGCGGCGAAGTAAGTCAGATGTGAAAGCCCGAAGCTCAACTTCGGGACTGCATTTGAAACTGCTTTGCTAGAGTGCAGGAGAGGAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGCGAACAGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Peptococcaceae | Desulfitibacter | GCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGACGCAGCAACGCCGCGTGAGCGACGAAGGCCTTCGGGTCGTAAAGCTCTGTCATTGGGGAAGAAGTCTTGTGTGCGAATAGTGCATAAGGTGACGGTACCCGAGGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAACACGTAGGGGGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCGTGTAGGCGGCTTGGCAAGTCTAGTGTGAAATGCCTGGGCTCAACCCAGGATTTGCACTGGAAACTGCTAGGCTTGAGGGCAGGAGAGGCAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGCCTGACCCTGACGCTGAGGCGCGAAAGCGTGGGGAGCGAACAGGATTAGATAC |
| 0 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Moorella | GCAGCAGTGGGGAATCTTGCGCAATGGGGGAAACCCTGACGCAGCAACGCCGCGTGAGCGATGAAGGCCTTCGGGTTGTAAAGCTCTGTCATCAGGGACGAAGTCTCGTGCAAACGAGGTGACGGTACTTGAGGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAAAACGTAGGGGGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCGTGTAGGCGGCCCGACAAGTCAGATGTGAAAAACCCAGGCTCAACCTGGGGGTTGCATTTGAAACTGGCGGGCTTGAGGGCAGGAGAGGAGAGTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGACTGACCCTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATAC |
unique() to retrieve a list of the unique elements within an object¶You may be interested in knowing more about the data set you're working with such as "How many different genera turned up in our entire experiment?" Recall that we have a column labeled genus within our data set microbes.
You could extract the whole column and scan through it or look at just a portion of it.
# Recall: Use the $ sign to access named columns within your data.frame!
microbes$genus
As you may have noticed, this method printed the entire genus column. While useful information for certain aspects, it doesn't answer our main question of how many different genera turned up in our sampling.
The function unique() can help us answer this question by removing duplicated entries, thus living up to its name. It can take in a number of different objects but usually returns an object of the same type that it was given as input.
Let's take a look at using it on our question.
# Retrieve a list of unique genera from our data set
unique(microbes$genus)
length() or str() to retrieve the size of some objects¶Note from above that we have only one entry per genus, but how many genera are there in total? Here we introduce length() which does just as it implies by returning the length of a vector, list, or factor. You can also use it to set the length of those objects but it's not something we have reason to do.
On the other hand str() always gives us the same kind of information plus a little more. Later on, we'll see that more isn't always better and that using length() has its advantages.
# Two ways to see how many unique entries we have
# ?length
length(unique(microbes$genus))
# or
str(unique(microbes$genus))
chr [1:251] "Pseudomonas" "Candidatus" "Lachnoclostridium" "Trichococcus" ...
Using unique() we are returned a character vector containing 251 genera.
glimpse() and View() show us our data¶Suppose we want to see more of our data frame. There are a couple of choices that can be used outside of Jupyter Notebook. In RStudio you have access to your Environment pane which can give you a quick idea of values for variables in your environment, including a bit of what your data.frame looks like.
Clicking on a data object like microbes will generate a new tab that shows your entire data.frame in a human-readable format similar to an Excel spreadsheet. The same result can be accomplished by using the view command View(microbes).
The glimpse() command brings up a comprehensive summary of your object that looks very similar to the information provided in the Environment pane. You'll find it looks very much like the str() command but is formatted in a more human-readable way. It tries to provide as much information as possible in a small amount of space.
We can use this command in Jupyter so let's take a glimpse at glimpse().
# View(microbes)
# Only works in RStudio
# Let's compare str() to glimpse()
str(microbes)
spec_tbl_df[,12] [6,656 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ abundance: num [1:6656] 40.69 11.71 11.13 6.14 3.97 ... $ compound : chr [1:6656] "compound-free" "compound-free" "compound-free" "compound-free" ... $ salinity : chr [1:6656] "brackish" "brackish" "brackish" "brackish" ... $ group : chr [1:6656] "control" "control" "control" "control" ... $ replicate: num [1:6656] 1 1 1 1 1 1 1 1 1 1 ... $ kingdom : chr [1:6656] "Bacteria" "Bacteria" "Bacteria" "Bacteria" ... $ phylum : chr [1:6656] "Proteobacteria" "Proteobacteria" "Firmicutes" "Firmicutes" ... $ class : chr [1:6656] "Gammaproteobacteria" "Alphaproteobacteria" "Clostridia" "Bacilli" ... $ order : chr [1:6656] "Pseudomonadales" "Rhodospirillales" "Clostridiales" "Lactobacillales" ... $ family : chr [1:6656] "Pseudomonadaceae" "Rhodospirillaceae" "Lachnospiraceae" "Carnobacteriaceae" ... $ genus : chr [1:6656] "Pseudomonas" "Candidatus" "Lachnoclostridium" "Trichococcus" ... $ ASV : chr [1:6656] "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGC"| __truncated__ "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCT"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCC"| __truncated__ "GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCC"| __truncated__ ... - attr(*, "spec")= .. cols( .. abundance = col_double(), .. compound = col_character(), .. salinity = col_character(), .. group = col_character(), .. replicate = col_double(), .. kingdom = col_character(), .. phylum = col_character(), .. class = col_character(), .. order = col_character(), .. family = col_character(), .. genus = col_character(), .. ASV = col_character() .. )
# glimpse gives us less information overall but is also less redundant
glimpse(microbes)
Rows: 6,656 Columns: 12 $ abundance <dbl> 40.69, 11.71, 11.13, 6.14, 3.97, 3.90, 2.88, 2.54, 1.22, 1.0~ $ compound <chr> "compound-free", "compound-free", "compound-free", "compound~ $ salinity <chr> "brackish", "brackish", "brackish", "brackish", "brackish", ~ $ group <chr> "control", "control", "control", "control", "control", "cont~ $ replicate <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ $ kingdom <chr> "Bacteria", "Bacteria", "Bacteria", "Bacteria", "Bacteria", ~ $ phylum <chr> "Proteobacteria", "Proteobacteria", "Firmicutes", "Firmicute~ $ class <chr> "Gammaproteobacteria", "Alphaproteobacteria", "Clostridia", ~ $ order <chr> "Pseudomonadales", "Rhodospirillales", "Clostridiales", "Lac~ $ family <chr> "Pseudomonadaceae", "Rhodospirillaceae", "Lachnospiraceae", ~ $ genus <chr> "Pseudomonas", "Candidatus", "Lachnoclostridium", "Trichococ~ $ ASV <chr> "GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAA~
So the information provided by glimpse() is more sparse, the formatting is a little tighter and we don't have to see the extra column information as with str(), which can save a lot of vertical space. On the other hand, the command takes longer to type but that's a personal choice.
NA and NaN values¶What happens when you import data with missing values? These could be empty entries in a CSV file or blank cells in a xlsx file. Perhaps, as we'll see later it could be a specifically annotated entry like "No_Data". These are usually the result of missing data points from an experiment but could have origins in other reasons like low-threshold values depending on the source of your data.
Missing values in R are handled as NA or (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, especially NAs, have special ways to be dealt with otherwise it may lead to errors in functions that we frequently use.
Let us begin by building an example containing NA values.
# Set up some vectors for a data.frame
brand <- c("wildrose", "guinness", "grasshoper")
wheat.type <- c("Hard Red Spring", "Hard Red Winter", "Soft Red Winter" )
rating <- c(5, 7, NA)
# Put it all together with cbind (you can use this for data.frames too!)
NA.example <- data.frame(brand, wheat.type, rating)
# Look at our data frame
NA.example
| brand | wheat.type | rating |
|---|---|---|
| <chr> | <chr> | <dbl> |
| wildrose | Hard Red Spring | 5 |
| guinness | Hard Red Winter | 7 |
| grasshoper | Soft Red Winter | NA |
NA values¶Some mathematical functions can ignore NA value by setting the logical parameter na.rm = TRUE. Under the hood, if the function recognizes this parameter, it will remove the NA values before proceeding to perform its mathematical operation.
# Use the mean() function and see what happens with NA values
mean(rating) # some functions need to be explicitly told what to do with NAs. No errors though!
mean(rating, na.rm = TRUE) #Avoid using just "T" as an abbreviation for "TRUE"
apply() on data with NAs?¶Now, I am going to take the counts data from lecture 01 and add a few NAs. If I now try to calculate the mean number of counts, I will get NA as an answer for the rows that had NAs.
counts <- data.frame(Site1 = c(geneA = 2, geneB = 4, geneC = 12, geneD = 8),
Site2 = c(geneA = 15, geneB = NA, geneC = 27, geneD = 28),
Site3 = c(geneA = 10, geneB = 7, geneC = 13, geneD = NA))
counts
# Notice that we can only pass the function name "mean" and not any parameters
apply(counts, MARGIN = 1, mean)
| Site1 | Site2 | Site3 | |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | |
| geneA | 2 | 15 | 10 |
| geneB | 4 | NA | 7 |
| geneC | 12 | 27 | 13 |
| geneD | 8 | 28 | NA |
# Pass parameters in our call
apply(counts, MARGIN = 1, mean, na.rm = TRUE)
# Equivalent code - perhaps clearer but more verbose
apply(counts, MARGIN = 1, FUN = function(x) mean(x, na.rm=TRUE))
is.na() function to check your data¶How do we find out ahead of time that we are missing data? Knowing is half the battle and is.na() can help us determine this with some data structures. The is.na() function can search through data structures and return a boolean structure of the same dimensions.
With a vector we can easily see how some basic functions work.
# Let's check out this vector that contains NA values
na_vector <- c(5, 6, NA, 7, 7, NA)
# This works on vectors...
is.na(na_vector)
# and data.frames too!
is.na(counts)
| Site1 | Site2 | Site3 | |
|---|---|---|---|
| geneA | FALSE | FALSE | FALSE |
| geneB | FALSE | TRUE | FALSE |
| geneC | FALSE | FALSE | FALSE |
| geneD | FALSE | FALSE | TRUE |
# Let's look at our microbe data for na values
is.na(microbes)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
any() function evaluates logical vectors¶In the case of large data frames, as you can see there are just too many entries to identify. Sometimes we are just interested in knowing if at least one of our logical values matches to TRUE. That is accomplished using the any() function which can evaluate multiple vectors (or data.frames), answering which of those has at least one TRUE value.
We can use it to quickly ask if our microbes data frame has any NA values.
# Before we dig too deep, can we check if there are ANY NA values in our data.frame?
any(is.na(microbes)) # logical (TRUE or FALSE).
Now we've confirmed that there are some NA values in our data. Given that there are 6656 rows, we need to find a way to identify those rows with NA values and conversely those without NA values. Let's start with simple structures.
which() function¶Using is.na() we were returned a logical vector of whether or not a value was NA. There are some ways we can apply this information through different functions but a useful method applicable to a vector of logicals is to ask which() positional indices return TRUE.
In our case, we use which() after checking for NA values in our object.
# Take a look at na_vector before you start manipulating it
na_vector
# wrap which() around our is.na() call
which(is.na(na_vector))
# indices where NAs are present in na_vector
na_values <- which(is.na(na_vector))
# cut out the na_values indices
removed_na_vector_1 <- na_vector[-na_values] ; removed_na_vector_1
#equivalent to
removed_na_vector_2 <- na_vector[!is.na(na_vector)] ; removed_na_vector_2
# Which values in microbes are NA? Recall we have 6656 rows of data!
which(is.na(microbes))
complete.cases() to query larger objects¶We have verified in many ways that we have at least one NA value in counts. Often we may wish to drop incomplete observations where one or more variables is lacking data. Using the which() function would be helpful but, as we can see from our above example, it only returns the element order for the whole data.frame. Instead, we want to look for rows that have any NA values. If you were only concerned with NA values in a specific column of your dataframe, which() would be a good way to accomplish your task.
In the case of removing any incomplete rows, the function complete.cases() looks by row to see whether any row contains an NA and returns a boolean vectors representing each row within the dataframe. You can then subset out the rows with the NAs using conditional indexing.
# ?complete.cases
# Outputs a logical vector specifying which observations/rows have no missing values across the entire sequence.
head(complete.cases(microbes), 20)
# Use it wisely to keep complete rows. Pop quiz [x,y] will it be x or y?
head(microbes[complete.cases(microbes), ], 5)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
# OR use the "!" to flip our boolean result and retrieve all of the incomplete cases!
microbes_NArows <- microbes[!complete.cases(microbes),]
microbes_NArows
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.07 | compound-free | brackish | control | 1 | Bacteria | Planctomycetes | Planctomycetacia | Planctomycetales | Planctomycetaceae | NA | GAATTGGCGGGGGCTCACACAAGCGGTGGAGGATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCTAGACTTGACATGCTTAAGAATCTTCTGGAAACAGAGGAGTGCCCTTCGGGGAGCTTTTGCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCTTAACGAGCGAAACCCTTATCTTTAGTTGCCAGCGGGTAATGCCGGGGACTCTAAAGAGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTTTAGGGCTGCACACGTCCTACAATGGTACGTACAAAGGGAAGCAAAGTCGCGAGGCCAAGCTAATCCCAAAAAGCGTGCCTCAGTTCGGATTGTAGGCTGCAATTCGCCTACATGAAGCTGGAATCGCTAGTAATCGCGGGTCAGCATACCGCGGTGAATGTGTTCCTGAGCCTTGTACA |
| 0.00 | compound-free | brackish | NA | 1 | Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | Comamonadaceae | Acidovorax | GAATTGACGGGGACCCGCACAAGCGGTGGATGATGTGGTTTAATTCGATGCAACGCGAAAAACCTTACCCACCTTTGACATGTACGGAATCCTTTAGAGATAGAGGAGTGCTCGAAAGAGAGCCGTAACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGCCATTAGTTGCTACGAAAGGGCACTCTAATGGGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATAGGTGGGGCTACACACGTCATACAATGGCTGGTACAAAGGGTTGCCAACCCGCGAGGGGGAGCTAATCCCATAAAGCCAGTCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGTCACGGTGAATACGTTCCCGGGTCTTGTACA |
| 0.00 | compound-free | brackish | control | 1 | Bacteria | Planctomycetes | Phycisphaerae | Phycisphaerales | Phycisphaeraceae | NA | GAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTTGCTTAATTCGAGGCAACGCGAAGAACCTTACCTGGGTTTGACATGCATGGATCAACTCGGTGAAAGCCGAGCCACAGTCGCAAGACCGGAACATGCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCTGTGAAGTGTCGGGTTAAGTCCTCTAACGAGCGCAACCCCTGTCGTTAGTTGCTCACGGGTTATGCCGAGTACTCTAACGAGACTGCCGCTGTAAAGCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACATCCAGGGTTGCAAACGTGCTACAATGGCGCGTACAAAGCGAAGCGAGGCCGCGAGGCGGAGCAAATCGCAAAAAGCGCGCCCCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATCGCTAGTAATCGGAGATCAGCTACGCTCCGGTGAATGTGTTCCTGAGCCTTGTACA |
| NA | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Ruminococcaceae | Ruminococcus | GAATTGGCGGGGGCCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCAAGAATCCTGTAGAGATACGGGAGTGCCTTCGGGAGCTTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTACGGTTAGTTGCTACGCAAGAGCACTCTAGCCGGACTGCCGTTGACAAAACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCGATTAACAAAGGGGAGCAATACAGCAATGTGGAGCAAATCCCAAAAAATCGTCTCAGTTCAGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCAGATCAGCATGCTGCGGTGAATACGTTCCCGGGCCTTGTACA |
| NA | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Paludibacter | GAATTGGCGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCAGACGACGGACTTGGAAACAGGTCTTCCAGCAATGGCGTCTGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCCTATCATTAGTTACTAACAGGTCGAGCTGAGGACTCTAGTGAGACTGCCGCCGTAAGGTGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGGGGTACAGAGGGTTGCTACCTGGTGACAGGATGCTAATCTCATAAATCCTCTCTCAGTTCGGATCGAAGTCTGCAACTCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| NA | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Vibrionales | Vibrionaceae | Catenococcus | GAATTGGCGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTGTTTGCCAGCGAGTAATGTCGGGAACTCCAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGACAGTACAGAGGGAAGCAAAGCGGCGACGTGGAGCGGAACCCAAAAAGCTGTTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACA |
| NA | compound-free | brackish | control | 2 | Bacteria | Verrucomicrobia | Verrucomicrobiae | Verrucomicrobiales | Verrucomicrobiaceae | Luteolibacter | GAATTGACGGGGACCCGCACAAGCGGTGGAGTATGTGGCTTAATTCGATGCAACGCGAAGAACCTTACCAAGGCTTGACATGCATCTCTAAGCGCGTGAAAGCGCGTGACCCTTCGGGGGATTTGCACAGGTGCTGCATGGCCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTGATTAGTTGCCAGCGGGTAATGCCGGGAACTCTAATCAGACTGCCCAGATCAACTGGGAGGAAGGTGGGGACGACGTCAGGTCAGTATGGCCCTTACGCCTTGGGCTGCACACGTACTACAATGCCCAGCACAATGAGAACCGAGACCGCGAGGTGGAGGAAATCTGCAAAACTGGGCCCAGTTCGGATTGGAGGCTGCAACCCGCCTCCATGAAGTTGGAATCGCTAGTAATGGCACATCAGCAACGGTGCCGTGAATACGTTCCCGGGTCTTGTACA |
| NA | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Alteromonadaceae | Marinobacter | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCCTTTCCCTAGTTGCTAGCAGGTAATGCTGAGAACTCTAGGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAGGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGTGCGTACAGAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCTTAAAACGCATCGTAGTCCGGATCGGAGTCTGCAACTCGACTCCGTGAAGTCGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Actinobacteria | Actinobacteria | Frankiales | Sporichthyaceae | NA | GAATTGGCGGGGCCCCGCACAAGCAGCGGAGCATGCGGCTTAATTCGACGCAACGCGAAGAACCTTACCAAGGCTTGACATATACAGGAATATGGCAGAGATGTCATAGCCGCAAGGTCTGTATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTTCTGTGTTGCCAGCATTTAGTTGGGGACTCACAGGAGACTGCCGGGGTTAACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTGCACGCATGCTACAATGGCTGGTACAAACGGCTGCGATACCGCAAGGTGGAGCGAATCCGAAAAAGCCAGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTTGCTAGTAATCGTAGATCAGCAACGCTACGGTGAATACGTTCCCGGGGCTTGCACA |
NAs with something useful¶Depending on your data or situation, you may want to include rows (observations) even though some aspects may be incomplete. Instead, consider replacing NAs in your data set. This could be replacement with a sample average, or the mode of the data, or a value that is below a threshold.
# Find the NA values and just replace them with an equivalent value that makes sense to your analysis
microbes_NArows[is.na(microbes_NArows$abundance),]$abundance <- 0
microbes_NArows
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.07 | compound-free | brackish | control | 1 | Bacteria | Planctomycetes | Planctomycetacia | Planctomycetales | Planctomycetaceae | NA | GAATTGGCGGGGGCTCACACAAGCGGTGGAGGATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCTAGACTTGACATGCTTAAGAATCTTCTGGAAACAGAGGAGTGCCCTTCGGGGAGCTTTTGCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCTTAACGAGCGAAACCCTTATCTTTAGTTGCCAGCGGGTAATGCCGGGGACTCTAAAGAGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTTTAGGGCTGCACACGTCCTACAATGGTACGTACAAAGGGAAGCAAAGTCGCGAGGCCAAGCTAATCCCAAAAAGCGTGCCTCAGTTCGGATTGTAGGCTGCAATTCGCCTACATGAAGCTGGAATCGCTAGTAATCGCGGGTCAGCATACCGCGGTGAATGTGTTCCTGAGCCTTGTACA |
| 0.00 | compound-free | brackish | NA | 1 | Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | Comamonadaceae | Acidovorax | GAATTGACGGGGACCCGCACAAGCGGTGGATGATGTGGTTTAATTCGATGCAACGCGAAAAACCTTACCCACCTTTGACATGTACGGAATCCTTTAGAGATAGAGGAGTGCTCGAAAGAGAGCCGTAACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGCCATTAGTTGCTACGAAAGGGCACTCTAATGGGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATAGGTGGGGCTACACACGTCATACAATGGCTGGTACAAAGGGTTGCCAACCCGCGAGGGGGAGCTAATCCCATAAAGCCAGTCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGTCACGGTGAATACGTTCCCGGGTCTTGTACA |
| 0.00 | compound-free | brackish | control | 1 | Bacteria | Planctomycetes | Phycisphaerae | Phycisphaerales | Phycisphaeraceae | NA | GAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTTGCTTAATTCGAGGCAACGCGAAGAACCTTACCTGGGTTTGACATGCATGGATCAACTCGGTGAAAGCCGAGCCACAGTCGCAAGACCGGAACATGCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCTGTGAAGTGTCGGGTTAAGTCCTCTAACGAGCGCAACCCCTGTCGTTAGTTGCTCACGGGTTATGCCGAGTACTCTAACGAGACTGCCGCTGTAAAGCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACATCCAGGGTTGCAAACGTGCTACAATGGCGCGTACAAAGCGAAGCGAGGCCGCGAGGCGGAGCAAATCGCAAAAAGCGCGCCCCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATCGCTAGTAATCGGAGATCAGCTACGCTCCGGTGAATGTGTTCCTGAGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Ruminococcaceae | Ruminococcus | GAATTGGCGGGGGCCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCAAGAATCCTGTAGAGATACGGGAGTGCCTTCGGGAGCTTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTACGGTTAGTTGCTACGCAAGAGCACTCTAGCCGGACTGCCGTTGACAAAACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCGATTAACAAAGGGGAGCAATACAGCAATGTGGAGCAAATCCCAAAAAATCGTCTCAGTTCAGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCAGATCAGCATGCTGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Paludibacter | GAATTGGCGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCAGACGACGGACTTGGAAACAGGTCTTCCAGCAATGGCGTCTGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCCTATCATTAGTTACTAACAGGTCGAGCTGAGGACTCTAGTGAGACTGCCGCCGTAAGGTGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGGGGTACAGAGGGTTGCTACCTGGTGACAGGATGCTAATCTCATAAATCCTCTCTCAGTTCGGATCGAAGTCTGCAACTCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Vibrionales | Vibrionaceae | Catenococcus | GAATTGGCGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTGTTTGCCAGCGAGTAATGTCGGGAACTCCAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGACAGTACAGAGGGAAGCAAAGCGGCGACGTGGAGCGGAACCCAAAAAGCTGTTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Verrucomicrobia | Verrucomicrobiae | Verrucomicrobiales | Verrucomicrobiaceae | Luteolibacter | GAATTGACGGGGACCCGCACAAGCGGTGGAGTATGTGGCTTAATTCGATGCAACGCGAAGAACCTTACCAAGGCTTGACATGCATCTCTAAGCGCGTGAAAGCGCGTGACCCTTCGGGGGATTTGCACAGGTGCTGCATGGCCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTGATTAGTTGCCAGCGGGTAATGCCGGGAACTCTAATCAGACTGCCCAGATCAACTGGGAGGAAGGTGGGGACGACGTCAGGTCAGTATGGCCCTTACGCCTTGGGCTGCACACGTACTACAATGCCCAGCACAATGAGAACCGAGACCGCGAGGTGGAGGAAATCTGCAAAACTGGGCCCAGTTCGGATTGGAGGCTGCAACCCGCCTCCATGAAGTTGGAATCGCTAGTAATGGCACATCAGCAACGGTGCCGTGAATACGTTCCCGGGTCTTGTACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Alteromonadaceae | Marinobacter | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCCTTTCCCTAGTTGCTAGCAGGTAATGCTGAGAACTCTAGGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAGGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGTGCGTACAGAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCTTAAAACGCATCGTAGTCCGGATCGGAGTCTGCAACTCGACTCCGTGAAGTCGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Actinobacteria | Actinobacteria | Frankiales | Sporichthyaceae | NA | GAATTGGCGGGGCCCCGCACAAGCAGCGGAGCATGCGGCTTAATTCGACGCAACGCGAAGAACCTTACCAAGGCTTGACATATACAGGAATATGGCAGAGATGTCATAGCCGCAAGGTCTGTATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTTCTGTGTTGCCAGCATTTAGTTGGGGACTCACAGGAGACTGCCGGGGTTAACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTGCACGCATGCTACAATGGCTGGTACAAACGGCTGCGATACCGCAAGGTGGAGCGAATCCGAAAAAGCCAGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTTGCTAGTAATCGTAGATCAGCAACGCTACGGTGAATACGTTCCCGGGGCTTGCACA |
dplyr (DEE ply er) package¶Now that we've inspected our data for various pitfalls, we can move on to filtering and sorting. To be able to answer any questions with our data, we need the ability to select and filter parts of our data. This can be accomplished with base functions in R, but the dplyr package provides a more human-readable syntax.
Image courtesy of xkcd
The dplyr package was made by Hadley Wickham to help make data frame manipulation easier. It has 5 major functions:
filter() - subsets your data.frame by rowselect() - subsets your data.frame by columnsarrange() - orders your data.frame alphabetically or numerically by ascending or descending variablesmutate(), transmute() - create a new column of datasummarize() or summarise() - reduces data to summary values (for example using mean(), sd(), min(), quantile(), etc)It is often extremely useful to subset your data by some logical condition. We've seen some examples above where we used functions and code to identify and keep specific rows. Let's dig deeper into that topic.
Conditionals ask a question about one or more values and return a logical (TRUE or FALSE) result. Here's a quick table breaking down the uses of basic conditional statements.
| Logical operator | Meaning | Example | Result |
|---|---|---|---|
== |
equal to | "this" == "that" | FALSE |
!= |
not equal to | 4 != 5 | TRUE |
> |
greater than | 4 > 5 | FALSE |
>= |
greater than or equal to | 4 >= 5 | FALSE |
< |
less than | 4 < 5 | TRUE |
<= |
less than or equal to | 4 <= 5 | TRUE |
Mastering the meaning and use of these logical operators will go a long way to helping you in your data science journey!
%in% syntax to compare sets¶Sometimes the simplest kind of conditional can be thought of as comparing two sets of data. Which values in A exist in B? For example, we may want to keep all rows that have either Smithella OR Methanobacteria.
To accomplish this using basic functions in R, we turn to the match binary operator, %in%, which can ask for us does x contain any elements present in y using the syntax x %in% y. This operator usually returns a logical vector matching the size of x with TRUE values if the element from x is in y.
Let's see what that looks like in the context of our above question.
# Find out more about the match operator by using double quotes
# ?"%in%"
# What does %in% return?
str(microbes$genus %in% c("Smithella", "Methanobacterium"))
logi [1:6656] FALSE FALSE FALSE FALSE FALSE FALSE ...
# You can filter your data using basic R commands
# Use the conditional result to index our data.frame
head(microbes[microbes$genus %in% c("Smithella", "Methanobacterium"),])
# how many rows (entries) do we find with our query?
nrow(microbes[microbes$genus %in% c("Smithella", "Methanobacterium"),])
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.31 | compound-free | brackish | control | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.27 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.27 | compound-free | brackish | control | 2 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 9.98 | compound-free | brackish | control | 3 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 2.48 | compound-free | brackish | control | 3 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
# A near-equivalent command using the logical OR
# This, however, is a cautionary example about filtering your data. Watch out for this command!
nrow(microbes[(microbes$genus == "Smithella" | microbes$genus == "Methanobacterium"),])
#The above command will also return any entries with NA in your filtered criteria.
microbes[which(is.na(microbes$genus)),]
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.07 | compound-free | brackish | control | 1 | Bacteria | Planctomycetes | Planctomycetacia | Planctomycetales | Planctomycetaceae | NA | GAATTGGCGGGGGCTCACACAAGCGGTGGAGGATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCTAGACTTGACATGCTTAAGAATCTTCTGGAAACAGAGGAGTGCCCTTCGGGGAGCTTTTGCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCTTAACGAGCGAAACCCTTATCTTTAGTTGCCAGCGGGTAATGCCGGGGACTCTAAAGAGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTTTAGGGCTGCACACGTCCTACAATGGTACGTACAAAGGGAAGCAAAGTCGCGAGGCCAAGCTAATCCCAAAAAGCGTGCCTCAGTTCGGATTGTAGGCTGCAATTCGCCTACATGAAGCTGGAATCGCTAGTAATCGCGGGTCAGCATACCGCGGTGAATGTGTTCCTGAGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 1 | Bacteria | Planctomycetes | Phycisphaerae | Phycisphaerales | Phycisphaeraceae | NA | GAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTTGCTTAATTCGAGGCAACGCGAAGAACCTTACCTGGGTTTGACATGCATGGATCAACTCGGTGAAAGCCGAGCCACAGTCGCAAGACCGGAACATGCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCTGTGAAGTGTCGGGTTAAGTCCTCTAACGAGCGCAACCCCTGTCGTTAGTTGCTCACGGGTTATGCCGAGTACTCTAACGAGACTGCCGCTGTAAAGCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACATCCAGGGTTGCAAACGTGCTACAATGGCGCGTACAAAGCGAAGCGAGGCCGCGAGGCGGAGCAAATCGCAAAAAGCGCGCCCCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATCGCTAGTAATCGGAGATCAGCTACGCTCCGGTGAATGTGTTCCTGAGCCTTGTACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Actinobacteria | Actinobacteria | Frankiales | Sporichthyaceae | NA | GAATTGGCGGGGCCCCGCACAAGCAGCGGAGCATGCGGCTTAATTCGACGCAACGCGAAGAACCTTACCAAGGCTTGACATATACAGGAATATGGCAGAGATGTCATAGCCGCAAGGTCTGTATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTTCTGTGTTGCCAGCATTTAGTTGGGGACTCACAGGAGACTGCCGGGGTTAACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTGCACGCATGCTACAATGGCTGGTACAAACGGCTGCGATACCGCAAGGTGGAGCGAATCCGAAAAAGCCAGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTTGCTAGTAATCGTAGATCAGCAACGCTACGGTGAATACGTTCCCGGGGCTTGCACA |
filter() function to replicate %in% and more!¶From our query above we already know we were asking R to search through our data frame under the genus column for any matches to Smithella OR Methanobacteria. The notation, however, can be a little confusing whereas the filter() function can accomplish the same task in a more human-readable syntax.
Using the filter() function we can evaluate each row with our criteria. Our first argument will be our data.frame, followed by the information for the rows we want to subset by. Notably, filter() drops any NA rows/values that might result from our comparisons. Why is that important?
# But the syntax using filter is much more human readable
filter(microbes, genus == "Smithella" &
genus == "Methanobacterium")
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
Our code produced an empty tibble because we used the logical operator & (AND). For us it makes sense to want only Smithella AND Methanobacteria, but to R it won't make sense because a genus can't be both Smithella AND Methanobacteria at the same time. That's why we need to use the | (OR) operator to select everything that is Smithella OR Methanobacteria. Here's a handy summary about the logical operators.
| Operator | Description | Use or Result |
|---|---|---|
| ! | Logical NOT | Converts boolean results into their opposite |
| & | Element-wise logical AND | Perform element-wise AND result having length of the longer operand |
| && | Logical AND | Examines only the first element of the operands resulting into a single length logical vector |
| | | Element-wise logical OR | Perform element-wise OR result having length of the longer operand |
| || | Logical OR | Examines only the first element of the operands resulting into a single length logical vector |
Now, let's try that filter() command again.
# Filter microbes using the proper logical operator
nrow(filter(microbes, genus == "Smithella" |
genus == "Methanobacterium"))
#Will this work?
nrow(filter(microbes, genus == c("Smithella", "Methanobacterium")))
What happened with our above command? Why did it return only 26 rows? To be honest, it was lucky that the operation worked at all! When R encounters operations between vectors of different size, it will recycle the shorter of the vectors when it can.
Here's an example
c(1,2,3) + c(10,11)
Warning message in c(1, 2, 3) + c(10, 11): "longer object length is not a multiple of shorter object length"
In this case, R gave us a warning that our vectors don't match in length. It returned to us a vector of length 3 (our longest vector), and it recycled the 10 from the shorter vector to add to the 3.
However, R will assume that you know what you are doing as long as one of your vector lengths is a multiple of your other vector length. Here the shorter vector is recycled twice. No warning is given.
c(1,2,3,4) + c(10,11)
%in% instead of ==¶Going back to our broken code,
nrow(filter(microbes, genus == c("Smithella", "Methanobacterium")))
while well-intentioned was basically saying "filter for odd rows where
genus == "Smithella" and even rows where genus == "Methanobacterium".
Recall that %in% is a binary match operator that says "for each element in genus, does that element exist in the vector c("Smithella", "Methanobacterium")".
# Use the right operator to get the job done when filtering with vectors
nrow(filter(microbes, genus %in% c("Smithella", "Methanobacterium")))
filter() to identify matching candidates with criteria across multiple variables¶We just filtered for multiple Genera (multiple rows based on the identity of values in a single column). However, you can also filter for rows based on values in multiple columns. We can do this from basic priniciples too but this is where the filter() function really shines as it keeps the query language quite clear for us and others to read and interpret.
For example, you can use the following filtering combinations:
# Query for samples of either genus Smithella or class Methanobacteria
head(filter(microbes,
genus == "Smithella" | class == "Methanobacteria"))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.31 | compound-free | brackish | control | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.27 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.27 | compound-free | brackish | control | 2 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 9.98 | compound-free | brackish | control | 3 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 2.48 | compound-free | brackish | control | 3 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
# == means "is exactly".
# Query for rows with abundance equal to 0 from only first replicates
head(filter(microbes, abundance == 0 & replicate == 1))
# The equivalent call is
head(filter(microbes,
abundance == 0, replicate == 1)) # Under the hood this is combined with logical &
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Epsilonproteobacteria | Campylobacterales | Campylobacteraceae | Campylobacter | GAATAGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGACTTGACATCCTAAAAACATCTAAGAGATTAGAAAGTGCTAGTTTACTAGAATTTAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCACGTGTTTAGTTGCTAACAGCTCGGCTGAGCACTCTAAACAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT |
| 0 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | GAATAGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGATTGAAATTTAGATGTTGGCAGATGAGAGTTTGCTTTCCTTTGGGACATCTAAGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCGCGTCGATAGTTACTAACAGGTTAAGCTGAGGACTCTATCGAGACAGCCGTCGTAAGACGTGAGGAAGGGGCGGATGACGTCAAATCAGCACGGCCCTTACATCCGGGGCGACACACGTGTTACAATGGCAGGGACAAAGGGAAGCGACATGGTGACATGAAGCGGATCTCCAAACCCTGTCCCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Xanthomonadales | Xanthomonadaceae | Arenimonas | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAGCTCTTGACATCTTCGGAACTTTCTAGAGATAGATTGGTGCCTTCGGGAACCGAATGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCACGTAATGGTGGGAACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTACTACAATGGTGGGGACAGAGGGTCGCAATGCCGCGAGGCGGAGCCAATCCCAGAAACCCTATCTTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Chromatiales | Ectothiorhodospiraceae | Ectothiorhodospira | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGCCCTTGACATCCTCGGAATCCTTCAGAGATGAGGGAGTGCCTTCGGGAACCGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCCTAGTTGCCAGCATTTCGGATGGGAACTCTAGGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTATGGGCAGGGCTACACACGTGCTACAATGGCCGGTACAGTGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAAAAAGCCGGTCGTAGTCCGAATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Eubacteriaceae | Acetobacterium | GAATAGGCGGGGACCCGCACAAGCAGCGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTCTGACAATCTGAGAGATCAGACTTTCCCTTCGGGGACAGAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTGGTTAGTTGCCATCATTTAGTTGGGCACTCTAAGCAGACTGCCGTGGATAACACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGTCTGAACAGAGGGTTGCGAAACCGCGAGGTGAAGCTAATCCCTTAAAACAGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTTGGAGTTGCTAGTAATCGCAGATCAGAATGCTGCGGTGAATGCGTTCCCGGGTCTTGCACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Epsilonproteobacteria | Campylobacterales | Campylobacteraceae | Arcobacter | GAATAGGCGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGCCTTGACATCCTTAGAATCTTTTAGAGATAAGAGAGTGCCTAGTTTACTAGGAGCTAAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATCATTAGTTGCTAACAGTTAGGCTGAGAACTCTAATGAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT |
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Epsilonproteobacteria | Campylobacterales | Campylobacteraceae | Campylobacter | GAATAGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGACTTGACATCCTAAAAACATCTAAGAGATTAGAAAGTGCTAGTTTACTAGAATTTAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCACGTGTTTAGTTGCTAACAGCTCGGCTGAGCACTCTAAACAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT |
| 0 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | GAATAGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGATTGAAATTTAGATGTTGGCAGATGAGAGTTTGCTTTCCTTTGGGACATCTAAGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCGCGTCGATAGTTACTAACAGGTTAAGCTGAGGACTCTATCGAGACAGCCGTCGTAAGACGTGAGGAAGGGGCGGATGACGTCAAATCAGCACGGCCCTTACATCCGGGGCGACACACGTGTTACAATGGCAGGGACAAAGGGAAGCGACATGGTGACATGAAGCGGATCTCCAAACCCTGTCCCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Xanthomonadales | Xanthomonadaceae | Arenimonas | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAGCTCTTGACATCTTCGGAACTTTCTAGAGATAGATTGGTGCCTTCGGGAACCGAATGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCACGTAATGGTGGGAACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTACTACAATGGTGGGGACAGAGGGTCGCAATGCCGCGAGGCGGAGCCAATCCCAGAAACCCTATCTTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Chromatiales | Ectothiorhodospiraceae | Ectothiorhodospira | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGCCCTTGACATCCTCGGAATCCTTCAGAGATGAGGGAGTGCCTTCGGGAACCGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCCTAGTTGCCAGCATTTCGGATGGGAACTCTAGGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTATGGGCAGGGCTACACACGTGCTACAATGGCCGGTACAGTGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAAAAAGCCGGTCGTAGTCCGAATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Eubacteriaceae | Acetobacterium | GAATAGGCGGGGACCCGCACAAGCAGCGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTCTGACAATCTGAGAGATCAGACTTTCCCTTCGGGGACAGAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTGGTTAGTTGCCATCATTTAGTTGGGCACTCTAAGCAGACTGCCGTGGATAACACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGTCTGAACAGAGGGTTGCGAAACCGCGAGGTGAAGCTAATCCCTTAAAACAGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTTGGAGTTGCTAGTAATCGCAGATCAGAATGCTGCGGTGAATGCGTTCCCGGGTCTTGCACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Epsilonproteobacteria | Campylobacterales | Campylobacteraceae | Arcobacter | GAATAGGCGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGCCTTGACATCCTTAGAATCTTTTAGAGATAAGAGAGTGCCTAGTTTACTAGGAGCTAAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATCATTAGTTGCTAACAGTTAGGCTGAGAACTCTAATGAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT |
# != means "is not"
# Query for microbes with abundance not equal to 0 from replicate 1 data
head(filter(microbes,
abundance != 0 & replicate == 1))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
# >= means "greater or equal than"
# Query for microbes where abundance is greater than or equal to 0 from replicate 1 data (see above)
head(filter(microbes,
abundance >= 0 & replicate == 1))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
# >= means "lesser or equal than"
# Query for microbes where abundance is less than or equal to 0 from replicate 1 data
head(filter(microbes,
abundance <= 0 & replicate == 1))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Epsilonproteobacteria | Campylobacterales | Campylobacteraceae | Campylobacter | GAATAGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGACTTGACATCCTAAAAACATCTAAGAGATTAGAAAGTGCTAGTTTACTAGAATTTAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCACGTGTTTAGTTGCTAACAGCTCGGCTGAGCACTCTAAACAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT |
| 0 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | GAATAGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGATTGAAATTTAGATGTTGGCAGATGAGAGTTTGCTTTCCTTTGGGACATCTAAGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCGCGTCGATAGTTACTAACAGGTTAAGCTGAGGACTCTATCGAGACAGCCGTCGTAAGACGTGAGGAAGGGGCGGATGACGTCAAATCAGCACGGCCCTTACATCCGGGGCGACACACGTGTTACAATGGCAGGGACAAAGGGAAGCGACATGGTGACATGAAGCGGATCTCCAAACCCTGTCCCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Xanthomonadales | Xanthomonadaceae | Arenimonas | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAGCTCTTGACATCTTCGGAACTTTCTAGAGATAGATTGGTGCCTTCGGGAACCGAATGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCACGTAATGGTGGGAACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTACTACAATGGTGGGGACAGAGGGTCGCAATGCCGCGAGGCGGAGCCAATCCCAGAAACCCTATCTTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Chromatiales | Ectothiorhodospiraceae | Ectothiorhodospira | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGCCCTTGACATCCTCGGAATCCTTCAGAGATGAGGGAGTGCCTTCGGGAACCGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCCTAGTTGCCAGCATTTCGGATGGGAACTCTAGGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTATGGGCAGGGCTACACACGTGCTACAATGGCCGGTACAGTGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAAAAAGCCGGTCGTAGTCCGAATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Eubacteriaceae | Acetobacterium | GAATAGGCGGGGACCCGCACAAGCAGCGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTCTGACAATCTGAGAGATCAGACTTTCCCTTCGGGGACAGAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTGGTTAGTTGCCATCATTTAGTTGGGCACTCTAAGCAGACTGCCGTGGATAACACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGTCTGAACAGAGGGTTGCGAAACCGCGAGGTGAAGCTAATCCCTTAAAACAGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTTGGAGTTGCTAGTAATCGCAGATCAGAATGCTGCGGTGAATGCGTTCCCGGGTCTTGCACA |
| 0 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Epsilonproteobacteria | Campylobacterales | Campylobacteraceae | Arcobacter | GAATAGGCGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGCCTTGACATCCTTAGAATCTTTTAGAGATAAGAGAGTGCCTAGTTTACTAGGAGCTAAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATCATTAGTTGCTAACAGTTAGGCTGAGAACTCTAATGAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT |
# > means "greater than"
# Query for microbes where abundance is above 0 from replicate 1 data
head(filter(microbes,
abundance > 0 & replicate == 1))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
# Query microbes for those in the proper genus in the first replicate
head(filter(microbes,
genus %in% c("Smithella", "Methanobacterium") & replicate == 1))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.31 | compound-free | brackish | control | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.27 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 6.69 | pyrene | brackish | treatment | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 4.31 | pyrene | brackish | treatment | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 3.52 | toluene | brackish | treatment | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 1.05 | toluene | brackish | treatment | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
# >= means "lesser than"
# Query microbes for any instances of abundance <0 in the first replicate.
head(filter(microbes, abundance < 0 & replicate == 1))
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
A powerful set of fuctions called regular expressions (regex) can also be used for partial matching. Regex are found in any programming language, not only in R, so familiarizing yourself with regex is a must as a programmer.
We will spend a large chunk of lecture 05 discussing regular expressions. Until then, just remember that you can use them as part of your filtering process. Below you'll find some useful functions that can help you accomplish this.
# More about regex
?regex()
# search for matches to argument pattern
?grep()
?grepl()
?regexpr()
?gregexpr()
?regexec()
# perform replacement of the first and all matches respectively.
?sub()
?gsub()
select() to subset and order columns in your data frame¶You can subset columns by using the select() function. You can also reorder columns using this function. Essentially this is a great way to move columns around your data frame or select() for the data columns you want in your data frame.
The select() function takes the format of select(data, ...) where
data is your data.frame or tibble object.... is a comma-separated list of columns from data. Suppose I want to to compare the abundance of species across all treatments, but I want genus in the last column.
# We just want to know abundance, compound, and genus
head(select(microbes, abundance, compound, genus))
| abundance | compound | genus |
|---|---|---|
| <dbl> | <chr> | <chr> |
| 40.69 | compound-free | Pseudomonas |
| 11.71 | compound-free | Candidatus |
| 11.13 | compound-free | Lachnoclostridium |
| 6.14 | compound-free | Trichococcus |
| 3.97 | compound-free | Proteiniphilum |
| 3.90 | compound-free | Tessaracoccus |
starts_with() and ends_with() helper functions to specify elements from a vector¶dplyr also includes some helper functions that allow you to select variables (columns) based on their names. For example, if we had multiple carbon sources listed in "compound1" and "compound2" we could select for columns that started with "com" using the starts_with() function.
# Select for columns starting with the word "com"
head(select(microbes, abundance, starts_with("com"), genus))
| abundance | compound | genus |
|---|---|---|
| <dbl> | <chr> | <chr> |
| 40.69 | compound-free | Pseudomonas |
| 11.71 | compound-free | Candidatus |
| 11.13 | compound-free | Lachnoclostridium |
| 6.14 | compound-free | Trichococcus |
| 3.97 | compound-free | Proteiniphilum |
| 3.90 | compound-free | Tessaracoccus |
# Select a variable using the last three letters of its name using `ends_with()`.
head(select(microbes, ends_with("und")))
| compound |
|---|
| <chr> |
| compound-free |
| compound-free |
| compound-free |
| compound-free |
| compound-free |
| compound-free |
Check out select in the help menu. Grab all of the variables that contain the letter "u". Retain genus either as rownames or in a column.
# Explicitly take genus and then add any variables that have a "u" in them
head(select(microbes, genus, contains("u")))
head(select(microbes, genus, matches("u")))
| genus | abundance | compound | group | phylum |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <chr> | <chr> |
| Pseudomonas | 40.69 | compound-free | control | Proteobacteria |
| Candidatus | 11.71 | compound-free | control | Proteobacteria |
| Lachnoclostridium | 11.13 | compound-free | control | Firmicutes |
| Trichococcus | 6.14 | compound-free | control | Firmicutes |
| Proteiniphilum | 3.97 | compound-free | control | Bacteroidetes |
| Tessaracoccus | 3.90 | compound-free | control | Actinobacteria |
| genus | abundance | compound | group | phylum |
|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <chr> | <chr> |
| Pseudomonas | 40.69 | compound-free | control | Proteobacteria |
| Candidatus | 11.71 | compound-free | control | Proteobacteria |
| Lachnoclostridium | 11.13 | compound-free | control | Firmicutes |
| Trichococcus | 6.14 | compound-free | control | Firmicutes |
| Proteiniphilum | 3.97 | compound-free | control | Bacteroidetes |
| Tessaracoccus | 3.90 | compound-free | control | Actinobacteria |
arrange()¶The arrange(data, ...) function helps you to sort your data. The default is ordered from smallest to largest (or a-z for character data). You can switch the order by specifying desc() (descending) as shown below. You can think of this like sorting in Excel, and you can sort by giving precedence to multiple columns.
# Sort microbes by descending value of abundance
descending_abundance <- arrange(microbes, desc(abundance))
head(descending_abundance)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 29.58 | toluene | brackish | treatment | 2 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 25.44 | toluene | brackish | treatment | 2 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 24.57 | compound-free | brackish | control | 3 | Archaea | Euryarchaeota | Methanomicrobia | Methanosarcinales | Methanosaetaceae | Methanosaeta | GAATTGGCGGGGGAGCACCACAACGGGTGGAGCTTGCGGTTTAATTGGATTCAACGCCGGAAATCTTACCGGGACCGACAGCAATATGAAGGCCAGGCTGAAGACTTTGCCGGATTAGCTGAGAGGTGGTGCATGGCCGTCGTCAGTTCGTACTGTGAAGCATCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCACAGTTGCCAGCGTACTCTCTGGAGTGACGGGTACACTGTGGGGACCGCCGCTGCTAAAGCGGAGGAAGGAATGGGCAACGGTAGGTCAGTATGCCCCGAATATCCCGGGCTACACGCGAGCTACAATGGTTGGTACAATGGGTATCTACCCCGAAAGGGGACGGGAATCTCCTAAAACCAATCTTAGTTCGGATTGAGGGCTGCAACTCGCCCTCATGAAGCTGGAATCCGTAGTAATCGCGTTTCAACAGAACGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 24.31 | compound-free | brackish | control | 2 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 23.90 | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
Let's say we want to look at the genera above 10% relative sequence abundance in samples that contained toluene as carbon source. We want to arrange our data frame in descending order of abundance.
How would you do it? How many unique taxa do we have?
# Use descending abundance and filter for our criteria
filtered_toluene_descending_abundance <- filter(descending_abundance,
abundance > 10 &
compound == "toluene") # Extra var 1
# Select just for the genus column
select_genera <- select(filtered_toluene_descending_abundance, genus) # Extra var 2
# How many of those are unique?
unique_genera <- unique(select_genera) # Extra var 3
# Print the result
unique_genera
| genus |
|---|
| <chr> |
| Lachnoclostridium |
| Candidatus |
| Methanosaeta |
| Pseudomonas |
| Smithella |
| Proteiniphilum |
| tolueneylobacterium |
%>%¶While the above code answered the question, it also created a bunch of new variables that we aren't interested in. These 'intermediate variables' were used to store data that got passed as input to the next function. This would quickly clutter our global environment if this was our strategy for data analysis. Instead, we can use a more "natural flow" of data to produce our code.
The dplyr package, and some other common packages for data frame manipulation allow the use of the pipe function, %>%. This is equivalent to | for any UNIX aficianados. Piping allows the output of a function to be passed to the next function without creating intermediate variables. Piping can save typing, make your code more readable, and reduce clutter in your global environment from variables you don't need. The keyboard shortcut for %>% is CTRL+SHIFT+M.
We are going to see how pipes work in conjunction with the function filter(), and then see the benefits to simplifying the code that we just wrote.
# Remember the R evaluates () from the inner to outer
head(filter(microbes, genus == "Smithella" | genus == "Methanobacterium"))
#equivalent to
microbes %>% filter(genus == "Smithella" | genus == "Methanobacterium") %>% head()
# It really reduces nested parentheses!
#equivalent to
microbes %>%
# note there's a period in the first position.
filter(., genus == "Smithella" | genus == "Methanobacterium") %>% head()
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.31 | compound-free | brackish | control | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.27 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.27 | compound-free | brackish | control | 2 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 9.98 | compound-free | brackish | control | 3 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 2.48 | compound-free | brackish | control | 3 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.31 | compound-free | brackish | control | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.27 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.27 | compound-free | brackish | control | 2 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 9.98 | compound-free | brackish | control | 3 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 2.48 | compound-free | brackish | control | 3 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 0.31 | compound-free | brackish | control | 1 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.27 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 0.27 | compound-free | brackish | control | 2 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
| 0.00 | compound-free | brackish | control | 2 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 9.98 | compound-free | brackish | control | 3 | Bacteria | Proteobacteria | Deltaproteobacteria | Syntrophobacterales | Syntrophaceae | Smithella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTAGGCTTGACATCCCTGGAATTCCGTGGAAACACGGAAGTGCCTTTCGGGGAACCAGGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAATTGCCAGCATTCAGTTGGGCACTTTAAAGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCTTAGGGCTACACACGTGCTACAATGGGTGGTACAAAGAGAAGCCAACTCGCGAGAGCGCGCAAATCTCAAAAAGCCATCCTCAGTTCGGATTGGAGTCTGCAACCCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 2.48 | compound-free | brackish | control | 3 | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobacterium | GAATTGGCGGGGGAGCACCACAACGCGTGGAGCCTGCGGTTTAATTGGATTCAACGCCGGACATCTCACCAGGGGCGACAGCAGAATGATAGCCAGGTTGATGACCTTGCTTGACAAGCTGAGAGGAGGTGCATGGCCGCCGTCAGCTCGTACCGTGAGGCGTCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCTTAGTTACCAGCGGATCCTTCGGGATGCCGGGCACACTAAGGGGACCGCCAGTGATAAACTGGAGGAAGGAGTGGACGACGGTAGGTCCGTATGCCCCGAATCCCCTGGGCTACACGCGGGCTACAATGGCTAGGACAATGGGTTCCGACACTGAAAAGTGAAGGTAATCTCCTAAACCTAGCCTTAGTTCGGATTGAGGGCTGTAACTCGCCCTCATGAAGCTGGAATGCGTAGTAATCGCGTGTCATAACCGCGCGGTGAATACGTCCCTGCTCCTTGCACA |
. with %>% denotes the object produced by the last called function¶You'll notice that when piping, we are not explicitly writing the first argument (our data frame) to filter(), but rather passing the first argument to filter using %>%. The dot . is sometimes used to fill in the first argument as a placeholder. This notation is useful for nested functions (functions inside functions), which we will come across a bit later.
What would working with pipes look like for our more complex example? Arrange abundance in descending order, keep genera with a relative sequence abundance of 10% or higher in treatments amended with toluene.
# Work our data based on what we want
microbes %>% arrange(desc(abundance)) %>% filter(abundance > 10 & compound == "toluene") %>% select(genus) %>% unique()
| genus |
|---|
| <chr> |
| Lachnoclostridium |
| Candidatus |
| Methanosaeta |
| Pseudomonas |
| Smithella |
| Proteiniphilum |
| tolueneylobacterium |
When using more than 2 pipes %>% it gets hard to follow for a reader (or yourself). Starting a new line after each pipe, allows a reader to easily see which function is operating and makes it easier to follow your logic. Using pipes also has the benefit that extra intermediate variables do not need to be created, freeing up your global environment for objects you are interested in keeping.
For this example we've tab-indented subsequent commands in the pipeline to additionaly separate things visually.
# Pass our data.frame
microbes %>%
# Arrange by abundance
arrange(desc(abundance)) %>%
# Filter for abundance and compound
filter(abundance > 10 & compound == "toluene") %>%
# Retain just the genus column
select(genus) %>%
# retrieve unique values
unique()
| genus |
|---|
| <chr> |
| Lachnoclostridium |
| Candidatus |
| Methanosaeta |
| Pseudomonas |
| Smithella |
| Proteiniphilum |
| tolueneylobacterium |
summarise()¶We can use summarise(data, ...) to define and retrieve summarised information about our dataset in a simplified way. This essentially creates a new data.frame object summarizing our observations based on the functions supplied. Multiple functions and their results can be placed into new columns we name.
Let's generate some values based on the abundance column of microbes.
# Summarise abundance for mean and standard deviation of all rows combined
summarise(microbes,
abundance_mean = mean(abundance),
abundance_sd = sd(abundance))
| abundance_mean | abundance_sd |
|---|---|
| <dbl> | <dbl> |
| NA | NA |
Uh oh! Remember that a number of functions can be told to ignore NA values when calculating their products. You'll have to check their argument information to be sure. For instance using ?mean to solve our problem.
# Summarise for mean and sd but make sure we ignore NA values
summarise(microbes,
abundance_mean = mean(abundance, na.rm = TRUE),
abundance_sd = sd(abundance, na.rm = TRUE))
| abundance_mean | abundance_sd |
|---|---|
| <dbl> | <dbl> |
| 0.3204917 | 1.732993 |
group_by() to reorder data based on variable categories¶Does the analysis from above really make sense? No. These microbial cultures were grown under different conditions (compounds, treatments/controls, salinities, etc.). We should be taking more variables into consideration. First, let's get means by compound using group_by() along with summarise().
Note that using group_by() produces a grouped data.frame object which behaves mostly like a standard data.frame but also has meta information about the grouping you've specified. This meta information can be used by other dplyr methods such as summarise()!
# Pass along microbes
microbes %>%
# group by compound
group_by(., compound) %>%
# Look at the first 10 rows
head(., 10)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
| 2.88 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAATTGCAGAGGAACATAGTTGAAAGATTATGGCCGCAAGGTCTCTGTGAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTTATCATTAGTTACTAACAGGTCATGCTGAGGACTCTAGTGAGACTGCCGTCGTAAGATGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCTACACACGTGTTACAATGGGGGGTACAGAGGGCAGCTACCGGGCGACCGGATGCCAATCCCAAAAACCTCTCTCAGTTCGGATCGAAGTCTGCAACCCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 2.54 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Shewanellaceae | Shewanella | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCTCAGAAGCCAGCGGAGACGCAGGTGTGCCTTCGGGAACTGAGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATCCTTACTTGCCAGCGGGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCGGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCTCATAAAGCCGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 1.22 | compound-free | brackish | control | 1 | Archaea | Thaumarchaeota | Marine.Group.I | Unknown.Order | Unknown.Family | Candidatus | GAATTGGCGGGGGAGCACCACAAGGGGTGAAGCCTGCGGTTCAATTGGAGTCAACGCCAGAAATCTTACCCGGAGAGACAGCAGAATGAAGGTCAAGCTGAAGACTTTACCAGACAAGCTGAGAGGTGGTGCATGGCCGTCGCCAGCTCGTGCCGTGAGATGTCCTGTTAAGTCAGGTAACGAGCGAGATCCCTGCCTCTAGTTGCCACCATTACTCTCAGGAGTAGTGGGGCGAATTAGCGGGACCGCCGTAGTTAATACGGAGGAAGGAAGGGGCCACGGCAGGTCAGTATGCCCCGAAACTCTGGGGCCACACGCGGGCTGCAATGGTAACGACAATGTGTTCCGAATCCGAAAGGAAGAGGTAATCCAGAAACGTTACCACAGTTATGACTGAGGGCTGCAACTCGCCCTCACGAATATGGAATCCCTAGTAACGGCGTGTCATTATCGCGCCGTGAATACGTCCCTGCTCCTTGCACA |
| 1.02 | compound-free | brackish | control | 1 | Archaea | Euryarchaeota | Methanomicrobia | Methanosarcinales | Methanosaetaceae | Methanosaeta | GAATTGGCGGGGGAGCACCACAACGGGTGGAGCTTGCGGTTTAATTGGATTCAACGCCGGAAATCTTACCGGGACCGACAGCAATATGAAGGCCAGGCTGAAGACTTTGCCGGATTAGCTGAGAGGTGGTGCATGGCCGTCGTCAGTTCGTACTGTGAAGCATCCTGTTAAGTCAGGCAACGAGCGAGACCCACGCCCACAGTTGCCAGCGTACTCTCTGGAGTGACGGGTACACTGTGGGGACCGCCGCTGCTAAAGCGGAGGAAGGAATGGGCAACGGTAGGTCAGTATGCCCCGAATATCCCGGGCTACACGCGAGCTACAATGGTTGGTACAATGGGTATCTACCCCGAAAGGGGACGGGAATCTCCTAAAACCAATCTTAGTTCGGATTGAGGGCTGCAACTCGCCCTCATGAAGCTGGAATCCGTAGTAATCGCGTTTCAACAGAACGCGGTGAATACGTCCCTGCTCCTTGCACA |
# The above code is equivalent to
# Pass along microbes
microbes %>%
# group by compound
group_by(., compound) %>%
# Summarise our grouped data
summarise(abundance_mean = mean(abundance, na.rm = TRUE),
abundance_sd = sd(abundance, na.rm = TRUE))
| compound | abundance_mean | abundance_sd |
|---|---|---|
| <chr> | <dbl> | <dbl> |
| compound-free | 0.3185560 | 1.967123 |
| pyrene | 0.3247483 | 1.607803 |
| toluene | 0.3179514 | 1.629708 |
Notice that the summarise() created a new tibble and it has the columns abundance_mean and abundance_sd. You can name these columns whatever you want. We also see the column, compound that we used to in group_by() command. Any columns used in that command will also be included since they are the foundation of the summarise() call.
# Here's the equivalent code without piping
summarise(group_by(microbes, compound),
abundance_mean = mean(abundance, na.rm=TRUE),
abundance_sd = sd(abundance, na.rm=TRUE))
Which option looks more "readable" to you?
mutate() to create new columns in your data frame¶Speaking about creating columns, let's explore the mutate() function. mutate() is a function to create new columns, most often the product of a calculation. For example, let's concatenate names from some of the columns by putting family and genus columns togther.
# Let's just re-initialize microbes here
microbes <- read_csv(file = "data/microbes.csv",
col_names = TRUE, col_types = cols())
# Start with our data.frame
microbes %>%
# Use the mutate command to combine
mutate(compound_salinity = paste(compound, salinity, sep = "_"), # compound with salinity
family_genus = paste(family, genus, sep = "_")) %>% # and family with genus
# Peek at our result
head()
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV | compound_salinity | family_genus |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA | compound-free_brackish | Pseudomonadaceae_Pseudomonas |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA | compound-free_brackish | Rhodospirillaceae_Candidatus |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA | compound-free_brackish | Lachnospiraceae_Lachnoclostridium |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA | compound-free_brackish | Carnobacteriaceae_Trichococcus |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA | compound-free_brackish | Porphyromonadaceae_Proteiniphilum |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA | compound-free_brackish | Propionibacteriaceae_Tessaracoccus |
Up to this point we've been doing a lot of piping with %>% and we can see the results in the output of our code but we have NOT been saving the results to a variable. This has two consequences:
If you want to save your data - perhaps after figuring out the series of steps you want to implement - you need to assign it to a variable or at least pipe it to a write*() function to save on disk.
Unlike the mutate() command, we can also directly alter our data structure by adding in new columns. New columns can be easily created using the $col_name syntax. If the column does not already exist, it will be created. Otherwise its data will be overwritten.
# adding columns can also be done using "base R" code:
# This will permanently change microbes
microbes$compound_salinity = paste(microbes$compound,
microbes$salinity,
sep = "_")
head(microbes)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV | compound_salinity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA | compound-free_brackish |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA | compound-free_brackish |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA | compound-free_brackish |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA | compound-free_brackish |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA | compound-free_brackish |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA | compound-free_brackish |
select() to remove columns¶We previously saw how to use select() to get a subgroup of columns we want, but we can also use it to "remove" columns. Note how our last call made a permanent change to microbes. To exclude the variable compound_salinity from microbes, we can use select(), then overwrite microbes. Simply add a - (minus) in front of compound_salinity.
# Check the column names before and after removing `compound_salinity`
colnames(microbes)
microbes <- select(microbes, -compound_salinity) # remove column compound_saline
head(microbes)
| abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | ASV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATGCAGAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTCTGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTTAAGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCCACCTTTGACATGGGACGTATGGGAAGCAGAGATGTTTTCCTTCAGTTCGGCTGGCGTCCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTCAGTTGCCATCATTCAGTTGGGCACTCTGAAGGAACTGCCGGTGACAAGCCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACAGGTGGGGCTACACACGTGCTACAATGGCGACTACAGAGGGGAGCTACCTCGCGAGAGGGCGCCAATCTCAAAAAGTCGTCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATCGGAATGACCGGTCCGTAACGGGGCCTTCCCTACGGGGCATTCCAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTAGTAGCCAGCAGTTCGGCTGGGCACTCTGGGGAGACTGCCAGGGATAACCTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGCGTAAACAAAGGGAAGCGAAGGAGTGATCCGGAGCAAATCTCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTTTGACAATCCTAGAGATAGGACTTTCCCTTCGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCAGCATTCAGTTGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGATGGTACAACGAGCAGCAAGACCGCGAGGTCAAGCGAATCTCTTAAAGCCATTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGCCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGTCTTGTACA |
| 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAATGCATCTGACGTATTCGGAAACGGATATTCCCTACGGGGCAGATGTGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCTCATCGTCAGTTACCATCAGGTAAAGCTGGGGACTCTGGCGAGACTGCCATCGTAAGATGCGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGGTGGTACAAAGGGCAGCTACCTGGCGACAGGATGCTAATCTCCAAAACCACTCTCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCACGGCGCGGTGAATACGTTCCCGGGCCTTGTACA |
| 3.90 | compound-free | brackish | control | 1 | Bacteria | Actinobacteria | Actinobacteria | Propionibacteriales | Propionibacteriaceae | Tessaracoccus | GAATTGACGGGGCCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATATGCCGGAAACATCTAGAGATAGGTGCCCCTTTATGGTCGGTTTACAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTCCTATGTTGCCAGCGGGTAATGCCGGGGACTCATAGGAGACCGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTTCACGCATGCTACAATGGCCGGTACAAAGAGCTGCGAACCTGCAAGGGTGAGCGAATCTCAAAAAGCCGGTCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGGCTTGTACA |
transmute() to create a new data.frame¶transmute() will also create a new variable(s), but it will drop the existing variables (it will give you a single column for each new variable). The output for transmute() is a tibble of your new variable(s).
microbes %>%
# Transmute some new columns
transmute(compound_salinity = paste(compound, salinity, sep ="_"),
family_genus = paste(family, genus, sep="_")) %>%
# Take a peek
head()
| compound_salinity | family_genus |
|---|---|
| <chr> | <chr> |
| compound-free_brackish | Pseudomonadaceae_Pseudomonas |
| compound-free_brackish | Rhodospirillaceae_Candidatus |
| compound-free_brackish | Lachnospiraceae_Lachnoclostridium |
| compound-free_brackish | Carnobacteriaceae_Trichococcus |
| compound-free_brackish | Porphyromonadaceae_Proteiniphilum |
| compound-free_brackish | Propionibacteriaceae_Tessaracoccus |
It is up to you whether you want to keep your data in a data.frame or switch to a vector if you are dealing with a single variable. Using a dplyr function will maintain your data in a data.frame. Using non-dplyr functions will switch your data to a vector if you have a 1-dimensional output.
What is the relative sequence abundance per genus per sample (sample is a new variable made of salinity, compound, and group). If we look at the data closely each trio (salinity/compound/group) tends to have 3 replicates so what we really want is the mean() value of abundance across replicates for any genus in a sample.
E.g. how many Pseudomonas are there on average in each sample? Make sure the final object should show only genus, sample, and another new column called mean.
# Pass along the dataframe
microbes %>%
# Make a new column by combining salinity, compound, and group
mutate(sample = paste(salinity, compound, group, sep = "_")) %>%
# Group based on the genus and that new column, sample
group_by(genus, sample) %>%
# Get a summary of the abundance
summarise(mean = mean(abundance, na.rm = TRUE)) %>%
# Sort by the mean in descending order
arrange(desc(mean)) %>%
# Take a peek
head()
`summarise()` has grouped output by 'genus'. You can override using the `.groups` argument.
| genus | sample | mean |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Pseudomonas | brackish_compound-free_control | 25.59000 |
| Methanosaeta | brackish_pyrene_treatment | 19.24000 |
| Lachnoclostridium | brackish_toluene_treatment | 17.86000 |
| Pseudomonas | saline_pyrene_treatment | 17.35000 |
| Methanosaeta | fresh_compound-free_control | 16.79333 |
| Methanosaeta | fresh_toluene_treatment | 14.23667 |
# Pass along the dataframe
microbes %>%
# Filter for only Pseudomonas
filter(... == "Pseudomonadaceae") %>%
# Make a new column by combining salinity, compound, and group
mutate(sample = paste(salinity, compound, group, sep = "_")) %>%
# Group based on the genus and that new column, sample
group_by(family, sample) %>%
# Get a summary of the abundance
summarise(mean = mean(abundance))
You've gone through all that trouble of learning how to import, filter, slice, and sort our datasets. Now comes the time to make sure that work doesn't go to waste. During larger scripts, there may be intermediate files you want to save just in case an error occurs further along. It can also give you a sense of how things are progressing. Whether it is an intermediate or final dataset that you would like to keep, it's time to learn how to save your files.
write_csv()¶We're ready to write microbes or any other data frame for that matter. In this case we won't overwrite our old data set but rather just create a second version of it.
Note that there are many ways to write data frames to files, including writing back to excel files! First we'll keep it simple and within the tidyverse with write_csv() which is a derivative of the write_delim() function. The write_csv() function includes some of the following parameters:
x: the data structure you'd like to write to file - preferably a tibble or data.frame.file: the file path where you are sending the output.na: a character string used for NA values - defaults to "NA".append: logical argument with FALSE as default (overwrites an existing file) or TRUE will append to an existing file. If the file doesn't exist in either case, it writes to a new file.col_names: logical argument to include the column names as part of the file. If unspecified, it will take the opposite value of append.getwd()
# Write our data to file
...(x = microbes,
file = "data/microbes2.csv",
col_names=TRUE)
%>% to direct your output to write_csv()¶That's right, you can pipe your data from filtering etc., over to write_csv(). While you may think this is usually the last step in your pipeline, it will actually write the data to file and then pass the input forward through the next pipe.
This has two implications:
Let's revisit our last summarizing pipeline.
write_result <-
# Pass along the dataframe
microbes %>%
# Filter for only Pseudomonas
filter(family == "Pseudomonadaceae") %>%
# Make a new column by combining salinity, compound, and group
mutate(sample = paste(salinity, compound, group, sep = "_")) %>%
# Group based on the genus and that new column, sample
group_by(family, sample) %>%
# Get a summary of the abundance
summarise(mean = mean(abundance)) %>%
# write your file to output
write_csv(x = ., file="data/microbe_summary.csv", col_names=TRUE)
# Take a look at the result of the pipeline
write_result
write_xlsx()¶Sometimes you may want to write multiple data frames to a single file like a xlsx format with sheets. This can be a convenient way to keep data together rather than making multiple write_csv() commands.
The writexl package contains the write_xlsx() function which can write the contents of a named list of data frames to multiple sheets. This function includes the following parameters:
x: a data.frame, tibble, or a named list of data framespath: the path to write the .xlsx file tocol_names: logical parameter for whether or not to write column names at the top of each sheetLet's give it a try to wrap up today's lecture!
# install.packages("writexl", dependencies = TRUE)
# library(writexl)
# Write a list to a single xlsx file
...(x = list("microbes_1" = microbes, "microbes_2" = microbes),
path = "data/microbes.xlsx",
col_names = TRUE
)
That's a wrap for our second class on R! You've made it through and we've learned about the following:
dplyr package.Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters from the Data Manipulation with dplyr course: Transforming data with dplyr (900 points); Aggregating data (1050 points); and Selecting and transforming data (900 points) for a total of 2850 points. This is a pass-fail assignment, and in order to pass you need to achieve a least 2138 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:
You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, September 30th to submit your assignment (right before the next lecture).
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.
https://googlesheets4.tidyverse.org/
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Syntax.html
http://stat545.com/block009_dplyr-intro.html
http://stat545.com/block010_dplyr-end-single-table.html
http://stat545.com/bit001_dplyr-cheatsheet.html
You may find for one reason or another that you prefer to use the base commands of R to import data. Here's you'll find a quick primer on using the read.csv() function.
read.csv()¶Let's read our microbes.csv data file into R. While we do these exercises, we are going to become friends with the help() function. Let's start by using the read.csv() function which is actually a simplified version of the function read.table(). Both of these functions are part of the base utils package in R, which is imported automatically. The read.csv() function has but is not limited to the following parameters:
file: the file name we want to importheader: logical parameter noting if your imported table has a header or not. Uses TRUE as the default value.sep: character parameter denoting how your fields are separated. Uses , as the default value.library(tidyverse)
# Remember the head() function? We'll import our file but just look at the first 6 rows of it
head(read.csv("data/microbes.csv"))
# Note that unlike read_csv() the result here is strictly a dataframe
str(read.csv("data/microbes.csv"))
NA values¶In addition to the functions we discussed in class there are some additional methods for dealing with NA values that can be helpful, depending on the structure of your data.
# Set up our data structures again
na_vector <- c(5, 6, NA, 7, 7, NA)
na_vector
# A data.frame with NA values
counts <- data.frame(Site1 = c(geneA = 2, geneB = 4, geneC = 12, geneD = 8),
Site2 = c(geneA = 15, geneB = NA, geneC = 27, geneD = 28),
Site3 = c(geneA = 10, geneB = 7, geneC = 13, geneD = NA))
counts
na.omit() function will remove NA entries¶In addition to our combination of functions from class, the na.omit() function can return an object where the NA values have been deleted in a listwise manner. This means complete cases (ie rows in a data.frame) will be removed instead. Keeping this in mind, you can also use this on a vector.
# equivalentish to our previous code our more complex code using is.na() and which() in combination
na.omit(na_vector)
# But under the hood it is doing something slightly different
# see how it works on data.frames?
na.omit(counts)
# Apply the log function to non-NA observations. In this case na.omit can be useful.
#?na.omit
apply(counts, MARGIN = 1, na.omit(log))
# Read more about apply() to learn more about why our data.frame is now transposed
You can similarly deal with NaN's in R. NaN's (not a number) are NAs (not available), but NAs are not NaN's. NaN's appear for imaginary or complex numbers or unusual numeric values. Some packages may output NAs, NaN's, or Inf/-Inf (can be found with is.finite()).
na_vector <- c(5, 6, NA, 7, 7, NA)
nan_vector <- c(5, 6, NaN, 7, 7, 0/0)
is.na(na_vector)
is.na(nan_vector)
is.na(nan_vector)
is.nan(nan_vector)
# These type of operations are very useful when working with conditional statements (if else, while, etc.).